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CRITERION-REFERENCED  TESTING: 
THE  ARMY 


A  DISCUSSION  OF  THEORY  AND  PRACTICE  IN 


INTRODUCTION 


4. 


This  report  is  an  interim  document  dealing  with  development  of  a 
Criterion-Referenced  Test  (CRT)  Construction  Manual.  The  major  objec¬ 
tives  of  the  study  were  the  development  of  an  easy-to-use,  '*how-to-do-it^ 
manual  to  assist  Army  test  developers  in  the  construction  of  CRTs,  and  the 
identification  of  needed  research  to  help  achieve  a  more  consistent, 
unified  criterion-referenced  test  model. 

In  order  to  accomplish  these  objectives,  the  study:  surveyed  the  liter¬ 
ature  on  criterion-referenced  testing  in  order  to  provide  an  information 
base  for  development  of  the  CRT  Construction  Manual;  visited  selected  Army 
posts  to  review  the  present  status  of  criterion- referenced  test  construction 
and  application  in  the  Army;  prepared  a  draft  CRT  construction  manual; 
conducted  a  trial  application  of  the  draft  manual;  and  revised  the  CRT 
construction  manual.  The  manual  for  developing  criterion-referenced  tests 
has  been  published  as  an  ARI  Special  Publication:  Guidebook  for  Developing 
Criterion-Referenced  Tests. ^ 


Part  1  of  this  report  reviews  the  technical  and  theoretical  literature 
in  criterion-referenced  testing.  This  review  is  a  serious  discussion  of  the 
state-of-the-art  in  criterion-referenced  testing,  designed  for  the  acade¬ 
mically-oriented  reader.  The  review  discusses  questions  of  CRT  reliability 
and  validity  in  both  practical  and  theoretical  areas,  different  methods  of 
CRT  construction,  simulation  fidelity  (e.g.,  the  extent  to  which  CRTs  can 
and  should  mirror  real-world  performance  conditions),  the  use  of  CRTs  in 
mastery  learning  contexts  and  to  test  development  and  item  sampling,  diag¬ 
nostic  uses  of  CRTs,  the  establishment  of  passing  scores,  and  uses  of  CRTs  in 
public  education  and  military  contexts. 

Part  2  describes  a  survey  of  Army  CRT  applications  at  a  number  of  Army 
installations.  Results  of  the  survey  are  indicated  through  an  analysis  of 
quantitative  data  collected  during  interviews  and  through  a  discussion  of 
qualitative  comments  received,  problems  observed,  and  areas  where  changes  may 
prove  beneficial  t,o  the  Army. 

Appendices  A,  B,  and  C  provide,  respectively,  the  Interview  Protocol  used 
during  the  Army  CRT  survey;  a  summary  of  types  of  individuals  interviewed 
at  each  Army  installation  surveyed;  and  quantitative  data  gathered  at  each 
Army  post. 
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PART  1- -REVIEW  OF  TECHNICAL  AND  THEORETICAL  LITERATURE 

Criterion-referenced  testing  (CRT)  has  been  widely  discussed  since  the 
term  was  popularized  by  Robert  Glaser  in  1963*  In  CRT,  questions  in¬ 
volving  comparisons  among  individuals  are  largely  irrelevant.  CRT  informa¬ 
tion  is  usually  used  to  evaluate  the  student's  mastery  of  instructional 
objectives,  or  to  approximately  locate  him  for  future  instruction  (Glaser 
and  Nitko,  l^/fl).  A  CRT  has  been  defined  variously  in  the  literature,  in 
fact  definitions  vary  so  widely  that  a  give-',  test  may  be  classified  as 
either  a  CRT  or  a  norm-referenced  test  (NRT)  according  to  the  particular 
definition  used.  Glaser  and  Nitko  (I97I)  propose  a  flexible  definition: 

"A  CRT  is  one  that  is  deliberately  constructed  so  as  to 
yield  measurements  that  are  directly  interpretable  in 
terms  of  specified  performance  standards....  The  per¬ 
formance  standards  are  usually  specified  by  defining 
some  domain  of  tasks  that  the  student  should  perform. 

Representative  samples  of  tasks  from  this  domain  are 
organized  into  a  test.  Measurements  are  taken  and  are 
used  to  make  a  statement  about  the  performance  of  each 
individual  relative  to  that  domain." 

Common  to  all  definitions  is  the  notion  that  a  well-defined  content  domain 
and  the  development  of  procedures  for  generating  appropriate  samples  of 
test  items  are  important.  Lyons  (1972)  argues  for  the  use  of  criterion- 
referenced  measurement  as  a  vital  part  of  training  quality  control: 

"...quality  control  requires  absolute  rather  than  relative 
criteria.  Scores  and  grades  must  reflect  how  many  course 
objectives  have  been  mastered  rather  than, how  a  student 
compares  with  other  students." 

For  the  purposes  of  this  review,  a  CRT  will  be  defined  as  a  test  where 
the  score  of  an  individual  is  interpreted  against  an  external  standard 
(e.g.,  a  standard  other  than  the  distribution  of  scores  of  other  testees). 
Further,  CRTs  are  tests  whose  items  are  operational  definitions  of  behavioral 
objectives. 

The  contemporary  interest  is  mastery  learning  has  led  to  a  growing 
interest  in  CRT.  CRTs  cah  be  used  to  serve  two  purposes: 

1.  They  can  be  used  to  provide  specific  information  about  the 

performance  levels  of  individuals  on  instructional  objectives. 

This  information  can  be  used  to  support  a  decision  as  to 
"mastery"  of  a  particular  objective  (Block,  1971). 
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They  can  bo  usoil  to  ovaluato  the  effectiveness  of  instruction. 

NRTs  given  at  the  end  of  a  course  are  loss  useful  for  making 
evaluative  decisions  of  the  effectiveness  of  instruction 
because  they  are  not  derived  from  the  particular  task  objectives. 
CRT  is,  however,  useful  for  the  evaluation  of  instruction 
because  of  the  specificity  of  the  results  to  the  task  objectives 
Lord,  Cronbach,  In'S;  Shoemaker,  1‘fVa.  l  fVh;  l-j  :  ietou, 

Rovinelli,  and  Gorth,  1  <71 '  . 

Pophum  V1  points  out  a  basic  concern  with  the  instrument  itself: 

”Wo  have  not  yet  made  r.n  acceptable  effort  to  delineate 
the  defining  dimensions  of  performance  tests,  in  terms' 
of  their  content,  objectives,  post-test  nature,  back¬ 
ground  information  level,  etc.  Almost  all  of  the  recently 
developed  performance  tests  have  been  devised  more  or  less 
on  the  basis  of  experience  and  instruction." 

Ebc  l  poses  a  series  of  arguments  against  the  use  of  CRT  in 

education.  Ebel  points  out  with  some  justification  that  CRT  measures  do 
not  toll  us  al 1  wo  need  to  know  about  educational  achievement,  pointing 
out  that  CRT  measures  are  not  efficient  at  discovering  relative  strengths 
and  deficiencies.  This  is  true  and  is  an  excellent  case  for  combining 
CRT  with  NRT  in  cases  where  both  relative  and  absolute  information  must 
be  gathered.  Ebel  also  raises  an  objection  shared  by  many  practicing 
educators  to  the  whole  "systems"  approach  to  educational  development. 

That  is,  objectives  specific  enough  to  support  the  generation  of  CRT  are 
more  likely  to  suppress  than  to  stimulate  "good  teaching".  Ebel  leaves 
us,  however,  without  a  metric  capable  of  defining  "good  teaching"  and  the 
untenable  assumption  that  "good  teaching"  is  the  rulei  Finally,  Kbel 
confuses  the  concept  of  mastery  of  material  with  the  practice  of  using 
percentile  grades  as  pass- fail  measures.  Ebel  does  not  address  the  notion 
that  CRT  as  currently  constructed  are  , the  result  of  the  application  of  a 
carefully  thought  out  analysis  and  development  system. 


RELIABILITY  AND  VALIDITY 

As  Glaser  and  Nitko  point  out,  the  appropriate  te 

an  empirical  estimation  of  CRT  reliability  is  not  clear.  Po 
suggest  the  traditional  NRT  estimates  of  internal  con 
stability  are  not  often  appropriate  because  of  their  depende 
test  score  variability.  CRTs  typically  are  interpreted  in  a 
fashion,  hence,  variability  is  drastically  reduced.  CRTs  imii 
consistent  and  stable,  yet  estimates  of  indexes  that  are  dep< 
variability  may  not  reflect  this.  This  section  will  critica 
number  of  studies  which  have  addressed  the  question  of  relial 
question  of  validity  of  CRTs  is  inextricably  mingled  with  thi 
issue  and  also  presents  many  facets  of  opinion  and  theory.  ' 
positions  concerning  reliability  and  validity  will  be  discus 
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Cox  and  Vargas  compared  the  results  obtained  from  two  item 

analysis  procedures  using  both  pre-test  and  post-test  scores;  a  Difference 
Index  VD1'  was  obtained  in  two  ways.  A  post-test  minus  pre-test  Dl  was  ■ 
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obtained  by  subtracting  the  percentage  of  students  who  passed  an  item  on 
the  pre-test  from  the  percentage  who  passed  on  the  post-test.  Also  a  DI 
was  obtained  in  the  more  conventional  manner.  After  post-test,  the 
distribution  of  scores  was  divided  into  the  upper  third  and  the  lower 
third,  then  the  percentage  of  students  in  the  lower  third  was  subtracted 
from  the  percentage  of  students  in  the  upper  third.  Hie  Spearman  Rho's 
obtained  between  the  two  Di's  were  of  a  moderate  order.  Ihe  authors  con¬ 
cluded  that  their  Dl  differed  sufficiently  from  the  traditional  method  to 
warrant  its  use  with  CRTs.  Hambleton  and  Gorth  il‘/fl)  replicated  the  work 
of  Cox  and  Vargas  (lkx'  and  found  that  the  choice  of  statistic  does  indeed 
have  a  significant  effect  on  the  selection  of  test  items.  The  change  in 
•item  difficulty  from  pre-to  post-test  seems  particularly  attractive  where  two 
test  administrations  are  possible.  Unfortunately,  however,  this  method  uses 
statistical  procedures  dependent  on  score  variability  which  are  questionable 
for  CRT  \ I’opham  and  Husek,  1  Randall,  l'r('2'  particularly  if  it  is  to  be 
employed  for  item  selection  ^Oakland,  I'TJ'.l) . 

Livingston  (1972a)  acknowledges  Popham  and  llusek' s  comment  that 
"the  typical  indexes  of  internal  consistency  are  not  appropriate  for 
criterion-referenced  tests".  Nevertheless,  Livingston  feels  that  the 
classical  theory  of  true  and  error  scores  can  be  used  in  determining  CRT 
reliability.  Livingston  points  out  that  "when  we  use  criterion-referenced 
measures  we  want  to  know  how  far.... [a!  score  deviates  from  a  fixed 
standard."  In  Livingston's  model,  each  concept  based  on  deviations  from  a 
mean  score  is  replaced  by  a  corresponding  concept  based  on  deviations  from 
the  criterion  score.  In  this  view,  criterion-referenced  reliability  can  be 
interpreted  as  a  ratio  of  mean  squared  deviation  from  the  criterion  score. 

If  this  view  is  accepted,  a  number  of  useful  relationships  are  provided; 
for  instance,  the  further  a  mean  score  is  from  the  criterion  score,  the 
greater  the  criterion-referenced  reliability  of  the  test  for  that  particular 
group.  In  effect,  moving  the  mean  score  away  from  the  criterion  score  has 
the  same  effect  on  criterion-referenced  reliability  that  increasing  the 
variance  of  true  scores  has  on  norm-referenced  reliability.  In  other  words, 
errors  of  misclassification  of  the  false  negative  variety  can  be  minimized 
by  accepting  as  true  masters  the  group  that  comfortably  exceeds  the  required 
criterion  level.  Another  point  is  that  if  we  accept  Livingston's  model, 
then  the  criterion-referenced  correlation  between  two  tests  depends  on  the 
difficulty  level  of  the  tests  for  the  particular  group  involved.  TV»o  tests 
can  have  a  high  correlation  only  if  each  is  of  similar  difficulty  for  the 
group  of  students.  This  provides  an  effective  limitation  for  the  computa¬ 
tion  of  inter-item  correlations  as  it  is  often  difficult  to  ensure  equal 
difficulty  levels,  which  must  fluctuate  with  the  group  being  tested. 

Regarding  Livingston's  (lv)72a)  proposal  that  the  psychometric  theory  of 
true  and  error  scores  could  be  adapted  to  CRT,  Oakland  (1972)  commented 
that  the  procedures  seemed  viable  but  that  the  conditions  under  which  they 
could  be  used  were  overly  restrictive. 

Harris  (1972)  objects  to  Livingston's  (l'V^a)  application  of  classical 
psychometric  theory  to  CRT,  pointing  out  that  whether  Livingston's  coefficient 
or  a  traditional  one  is  applied,  the  standard  error  of  measurement  remains 
the  same.  The  fact  that  Livingston's  coefficient  is  usually  the  larger  does 
notmean  a  more  dependable  determination  of  whether  or  not  a  true  score 
falls  above  or  below  the  criterion  score.  As  a  rebuttal,  Livingston  (1972b) 


indicates  that  Harris  overlooked  the  point  rhac  reliability  is  rot  a  prop¬ 
erty  of  a  single  store  but  of  a  group  of  '-c.res.  Livingston  also  points  out 
that  the  larger  criterion-referenced  reliability  does  imply  a  more  depend¬ 
able  overall  determination,  when  this  decision  is  to  be  made  for  all 
individual  scores  in  the  distribution. 

Meredith  and  Sabers  also  take  issue  with  Livingston’s  concept 

of  CRT  reliability  estimation  as  variability  around  the  criterion  score, 
pointing  out  that  CRT  is  concerned  primarily  with  the  accuracy  of  the 
pass-fail  decision  and  is  relatively  unconcerned  with  a  person's  attainment 
above  or  below  the  criterion  level, 

Roudabush  and  Green  present  an  analysis  of  false  positive  and 

false  negative  to  derive  reliable  estimates.  These  authors  presented 
several  methods  for  arriving  at  reliability  estimates  for  CRT.  The  first 
involves  ordering  items  into  a  hierarchical  order  of  increasing  difficulty. 
Roudabush  and  Green  propose  that  error  of  measurement  would  be  demonstrated 
if  a  student  failed  an  easier  item  while  passing  a  series  of  more  difficult 
items.  Oakland  (1  points  out  that  it  is  exceedingly  difficult  to 
establish  the  needed  hierarchical  order.  This  objection  has  been  raised  since 
Guttman  first  (l  ^L)  proposed  the  technique  of  hierarchical  ordering.  Roudabush 
and  Green  propose  a  second  technique  utilizing  point-biserial  correlation 
between  parallel  tests.  Their  results  with  this  method  were  far  from  encourag¬ 
ing.  In  addition,  there  is  great  difficulty  inherent  in  the  development  of 
parallel  tests.  The  third  method  involves  the  use  of  regression  equations 
to  predict  item  criterion  scores  but  has  not  y«?t  been  fully  explored. 

In  a  divergent  work,  Hambleton  and  Novick  (1\71)  propose  regarding  CRT 
reliability  as  the  consistency  of  decision-making  across  parallel  forms  of 
:he  CRT  or  across  repeated  measures.  They  view  validity  as  the  accuracy  of 
lecision-making.  This  view  departs  from  the  classic  psychometric  view  of 
reliability  and  validity  and  properly  so,  as  the  severly  restricted  variance 
:ncountered  with  CRT  will  cause  correlationally-based  estimates  of  reliability 
ind  Validity  to  be  artificially  low.  Hambleton  and  Novick  view  a  decision 
:heoretic  metric  such  as  a  "loss  function"  as  being  more  appropriate  for  use 
m  CRTs.  This  metric  must  serve  to  describe  if  an  individual's  true  score 
s  above  or  below  a  cutting  score.  The  concept  differs  markedly  from 
.ivingston's  (1072a)  notion  in  which  the  criterion  _is  regarded  as  the  true 
core. 


The  importance  of  correct  decision-making  in  CRT  applications  is  also 
ecognized  by  Edmonston,  Randall,  and  Oakland  (1^72)  who  present  a  CRT 
eliability  model  aimed  at  supporting  decisions  made  during  formative 
valuation  and  maximizing  the  probability  of  learning  an  established  set 
f  objectives.  Criterion-referenced  items  are  usually  binary  coded  pass- 
ail;  therefore,  summaries  of  group  performance  on  two  items  of  pre-  and 
ost-test  can  be  displayed  in  a  2  x  2  contingency  table.  Edmonston  et  al. 
ecommend  utilizing  the  cell  proportions  to  provide  information  about  the 
elationships  between  the  variables  represented  by  the  table.  They  find  that 
simple  summation  of  the  diagonal  proportions  £  paa  provides  a  very  useful 

aasure  of  agreement  between  categories— where  a  is  a  method  of  indicating 


cells  in  a  matrix  and  all  cells  have  the  same  classification  (pass-fail). 
They  also  recommend  a  supplemental  measure  A^  (Lambda)  a  variance-free 

coefficient.  Goodman  and  Kruskal  (1954)  define 


Epaa  -  *4’  (PM-  +  P-M) 
A 

r  »  - 

1  -  \  (PM*  +  P*M) 


where  PM‘  and  P.M  are  the  modal  class  frequencies  for  each  of  the  two 
cross-classifications.  A^  may  be  interpreted  as  the  relative  reduction  in 

the  probability  of  error  of  classification  when  goind  from  a  no-information 
situation  to  the  other-method-knovr.  situation.  Edmonston  et  al.  feel  the 
reliability  estimate  most  useful  to  CRT  is  the  extent  to  which  they  fluctu¬ 
ate  temporally.  They  fell  that,  minimally,’  CRT  items  should  provide  stable 
estimates  of  knowledge  of  curriculum  content;  £  pact  and  r  can  1  used  to 

provide  estimates  of  this  stability.  They  recommend  that  ^  pcta  be  used  to 

a 

judge  the  re-test  reliability  of  each  item.  However,  when  item  re-test 
reliability  falls  below  an  arbitrary  criterion  (Edmonston  et  al.  recommend 
'6‘/jo)  and  into  a  zone  of  decision,  A^is  employed  as  a  descriptive  measure 

of  the  amount  of  information  gained  by  employing  a  second  item  (the  re-test) 
in  making  curriculum  or  placement  decisions.  If  knowledge  of  the  re-test 
score  provides  additional  information,  the  item  is  retained.  However,  there 
is  no  current  basis  for  determining  the  acceptable  minimal  reduction  in 
classification  error. 

In  the  same  vein  as  Edmonston  et  al.,  Roudabush  (1973)  views  reliability 
as  referring  to  the  appropriateness  of  the  decisions  made  that  affect  the 
treatment  of  the  examinee.  Roudabush  emphasizes  "Minimizing  risk  or  cost 
to  examinee."  The  decision  iw  whether  to  discontinue  instruction  or 
remediate  or  wash-out. 

As  is  the  case  with  NRT  development,  determination  of  validity  for 
CRT  has  seen  less  investigation  than  reliability.  However,  it  s.eems  logical 
that  content' validity  must  be  the  paramount  concern  for  CRT  development. 
According  to  Popham  and  Husek  (1969)  content  validity  is  determined  by  "a 
carefully  made  judgement,  based  on  the  test's  apparent  relevance  to  the 
behaviors  legitimately  inferable  from  those  delimited  by  the  criterion." 

McFann  (1973)  views  the  content  validation  of  training  as  having  two 
major  dimensions.  The  first  dimension  is  the  role  of ,  the  human  within  the 
general  operating  system.  Generally,  this  is  defined  by  means  of  task 
analysis.  The  second  dimension  involves  the  skills  and  knowledge  the 
trainee  brings  with  him  to  the  course;  the  training  content  can  then  be 
viewed  as  a  residual  of  what  must  still  be  imparted  to  the  trainee.  The 
decision  of  what  to  include  in  the  training  must  also  be  tempered  by  manage¬ 
ment  orientation  to  cost  and  effectiveness.  Finally,  McFann  feels  that 
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sions  made  on  the  units  or  procedures  by  which  output  is  to  be  evalu- 
has  an  influence  on  validation  of  training  content.  McFann  views  the 
cation  of  training  content  as  a  dynamic,  interactive  process,  whereby 
ling  content  is  initially  determined  and  then,  on  the  basis  of  feedback 
tudent  performance  on  the  job,  instructional  content  as  well  as 
ruction  method  is  thedified  to  improve  overall  system  effectiveness. 

Edmonston,  Randall,  and  Oakland  (1/72)  hold  content  validation  as 
ral  to  CRT  development.  CRT  items  are  sampled  theoretically  from  a 
a  item  domain  and  must  be  representations  of  a  specified  behavioral 
ctive. 

dambleton  and  Novick  1/71)  propose  a  validity  theory  in  which  a  new 

Y  would  serve  as  criterion.  The  qualifying  score  of  the  second  test 
not  correspond  with  the  qualifying  score  of  the  predictor  CRT.  The 

Y  these  authors  suggest  might  be  derived  from  performance  on  the  next 
of  instruction,  or  it  may  be  a  job-related  performance  criterion. 

ough  this  appears  to  be  a  good  idea,  it  seems  that  different  conclusions 
d  be  reached  if  test  Y  were  a  job-related  criterion  instead  of  performance 
ne  next  unit  of  instruction.  The  fact  that  the  conclusion  might  be 
erent  could,  however,  yield  an  approximation  of  convergent  and  diver- 
validity.  Validation  of  a  test  determined  by  correlating  it  with 
ler  test  may,  however  give  a  distinct  overestimate  of  "validity".  This 
articularly  true  in  the  case  where  the  tasks  on  the  two  tests  are 
lar . 

Edmonston  et  al.  (l/'2)  advocate  a  method  of  CRT  validation  which  they 
the  criterion-oriented  approach,  which  includes  both  1  concurrent  and 
ictive  validity.  In  order  to  obtain  complete  information  about  an  item 
the  objective  it  assesses,  the  relationship  of  a,  CRT  to  other  measures 
Id  be  considered  i.e.,  ratings  by  teachers  and  training  observers  as 
as  performance  on  suitable  NRT  measures).  Edmonston  et  al.  view  these  ’ 
sasures  of  concurrent  validity,  although  these  multiple  indicators 
i,  if  properly  chose,  provide  an  estimate  of  construct  validity.  In 
assing  the  problems  of  predictive  validation,  Edmonston  et  al.  concur 
Kennedy  ;l'/72)  in  proposing  that  tes^s  of  curriculum  mastery  which 
asent  higher  order  concepts  taught  within  several  curriculum  units  be 
as  criteria  against  which  unit  test  items  would  be  assessed  as  to  their 
ictive  power.  In  addition,  unit  test  items  which  are  more  temporally 
imate  should  agree  more  strongly  with  Mastery  Test  items  than  items 
jnced  earlier.  This  notion  has  been  partially  verified  by  Edmonston  and 
co-workers.  Final  verification  of  t.iis  scheme  of  Validity  determination 
ires  factorially  pure  items  and  this  may  be  a  bit  too  much  to  ask  of 
writers.  Edmonston  et  al.  advocate  an  approach  to  construct  validity 
ially  put  forth  by  Nunnally  (I/67) .  In  Nunnally's  view,  the  measure- 
and  validation  of  a  construct  involve  the  determination  of  an  internal 
ork  among  a  set  of  measures,  and  the  consequent  formation  of  a  network 
robability  statements.  This  notion  is  not  too  far  from  Cronback  and 
l's  (1/55)  enunciation  of  the  need  for  a  "nomological  network"  with 
■»'  to  validate  a  construct.  Edmonston  et  al.  indicate  that  the  "specifi- 
on  of  a  hierarchy  of  learning  sets  among  items  would  seen  to  be  the 
nate  goal  of  construct  validation  procedures,  enabling  the  developm-  ,t 
eternal  and  cross  structures  between  items  and  the  consequent  understanding 


of  the  inter-relationships  of  all  curriculum  areas".  This  concept  would  be 
difficult  to  implement,  as  the  construction  of  learning  sets  is  not  an  easy 
procedure.  Also,  difficulty  can  be  expected  in  attempting  the  establishment 
of  a  network  of  relationships  sufficient  to  completely  define  a  construct. 

In  P.oudabush's  (1973)  view  of  validity,  CRT  items  are  designed  to  sample 
as  purely  as  possible  the  specified  domain  of  behavior,  then  tried  out  to 
determine  primarily  if  the  items  are  sensitive  to  instruction.  A  2x2  contin¬ 
gency  table  containing  post-test  and  pre-test  outcomes  is  the  basis  for 
analysis : 


Post-test 

failed  both  pre-  and  post 
£2"  failed  pre-,  passed  post¬ 
il”  passed  pre-,  failed  post¬ 
il"  passed  both  pre-  and  post 

Marks  and  Noll  (l9t>7 )  assume  l .■  due  to  gue^iing  and  derive  a  sensitivity 

index(s)  that  is  simply  the  proportion  of  cases  that  missed  the  item  on  the 
pre-test  and  passed  it  on  the  post-test  with  &  correction  for,  guessing. 


Roudabush  vl'L'O,  however,  found  that  to  derive  a  "reasonably  reliable" 
value  for  the  index  there  should  bo  '0  cases  who  missed  the  item  at  pre-test 
\ f^ ,  while  if  f,  cell  is  high  the  index  will  have  little  value  (neither 

will  the  item'.  This  index  ranges  from  1.00  to  0.G0  bat  may  go  below  0.0 
if  miskeyed.  A  problem  here  may  be  ensuring  that  different  but  parallel 
items  are  used  for  pre-  and  post-tests.  This  problem  is  a  practical  one, 
but  is  particularly  acute  when  complex  content  domains  are  contemplated. 

These  various  treatments  of  CRT  validity  all  exhibit  difficulties 
that  often  might  prove  insurmountable,  to  a  test  constructor  dealing  with 
"real  world"  problems.  Content  validity,  however,  is  extremely  important 
in  CRT  and  can  bo  reasonably  ensured  by  careful  attention  to  objective 
development.  Construct  validity  will  probably  prove  elusive  if  only  due 
tc  the  complexity  of  operations  3nd  measures  required  to  demonstrate  this 
form  of  validity.  Predictive  validity  appears  practicable  In  many  situations. 


CONSTRUCTION  METHODOLOGY, 

NRTs  are  primarily  designed  to  measure  individual  differences.  The 
meaning  which  can  be  attached  to  any  particular  score  depends  or.  a  compar¬ 
ison  of  that  score  to  a  relevant  norm  distribution.  A  norm-referenced  test 
is  constructed  specifically  to  maximize  the  variability  of  test  scores  since 
such  a  test  is  more  likely  to  produce  fewer  errors  in  ordering  the  individuals 
on  the  measured  ability.  Since  NRTs  are  often  used  for  selection  and  classi¬ 
fication  purposes,  it  follows  that  minimizing  the  number  of  order  errors  is 
extremely  important.  .  > 

NRTs  are  constructed  using  traditional  item  analysis  procedures.  It 
is  partly  because  of  this  that  the  test  scores  cannot  be  interpreted  rela¬ 
tive  to  some  well-defined  content  domain  since  items  are  normally  selected 
to  produce  tests  with  desired  statistical  properties  (e.g.,  difficulty  levels 
around  .  rather  than  to  be  representative  of  a  content  domain.  Likewise, 
a  wide  range  of  item  difficulty  does  not  occur  because  of  resulting  variance 
restriction.  Item  homogeneity  is  also  much  sought  in  development  of  NRTs. 

The  ultimate  purpose  is  to  spread  out  individuals  by  maximizing  the  discrimina 
ting  power  of  each  item.  The  emphasis  is  on  comparing  an  individual’s 
response  with  the  responses  of  others.  There  is  no  int.erest  in  absolute 
measurement  of  individual  skills  as  In  CRTs,  only  relative  comparison. 

Although  conceptually  allied  to  the  construction  of  NRTs,  Item  analysis 
is  an  important  tool  in  assembling  a  test  from  an  item  pool  and  therefore  has 
application  to  the  construction  of  certain  CRTs.  Although  content  validity 
is  an  important  characteristic  for  an  item  in  a  CRT.,  there  are  other  impor¬ 
tant  considerations  having  to  do  with  the  sensitivity  and  discriminating 
power  of  an  item.  These  features  are  important .when  evaluating  instruction 
and  in  ensuring  the  corr’ct  decision  regarding  an  individual’s  progress 
through  instruction. 

In  CRT  development,  the  item  difficulty  index  is  useful  for  selecting 
’good"  items.  However,  item  difficulty  is  used  differently  than  ia  NRT.  , 

If  the  content  domain  is  carefully  specified,  test  items  written  to  measure 
accomplishment  of  the  objectives  should  also  bo  carefully  specified  and 


closely  associated  with  the  objectives.  Therefore,  all  of  the  items 
associated  with  the  same  objectives  should  be  answered  correctly  by  about 
the  same  proportion  of  examinees  in  a  group.  Items  which  differ  greatly 
should  be  carefully  examined  to  determine  if  they  coincide  with  the  intent 
of  the  objectives. 

Similarly,  item  discrimination  indexes  can  be  useful  for  CRT  development 
Negative  discrimination  indexes  warn  that  CRT  items  need  modification, 
or  chat  the  instructional  process  is  at  fault.  A  negative  index  would  be 
indicative  of  a  high  proportion  of  "false  negatives";  conversely  a  positive 
discrimination  index  is  useful  for  diagnosing  shortcomings  in  the  instruc¬ 
tional  program. 

An  attempt  to  use  item  analysis  techniques  to  develop  test  evaluation 
indexes  was  undertaken  by  Ivens  (1V>70'>.  Ivens  defines  reliability  indexes 
based  on  the  concept  of  within  £  equivalance  of  scores.  Item  reliability 
is  defined  as  the  proportion  of  subjects  whose  item  scores  are  the  same  on 
the  post-test  and  either  a  re-test  or  parallel  form.  Score  reliability  is 
then  defined  as  the  average  item  reliability.  Unfortunately  the  need  for 
re-test  or  for  two  forms  (parallel)  would  seem  to  reduce  the  usefulness  of 
this  scheme  except  in  very  special  situations. 

Rahmlow,  Matthews  and  Jung  (l'/fO)  suggest  that  the  function  of  a 
discrimination  index  in  a  CRT  is  primarily  that  of  indicating  the  homoge¬ 
neity  of  the  item  with  respect  to  the  specific  instructional  objective 
measured.  These  authors  focus  attention  on  a  shift  in  item  difficulty  from 
pre-instruction  to  post-instruction. 

Helmstadter  compared  alternative  Indexes  of  item  usefulness. 

1.  Item  discrimination  based  on  high  and  low  groups  on  a  post- 
instructional  measure. 

Shift  in  item  difficulty  from  pre-to  post-instruction. 

5.  Item  discrimination  based  on  pie-  and  post-test  performance. 

Shift  in  item  difficulty  from  pre-  to  post-instruction  produced  results 
significantly  more  similar  to  the  pre-post  discrimination  index  than  did 
the  high- low  group  post- test  discrimination  index. 

Helmstadter  also  sought  to  compare  the  traditional  item  discrimination 
index  applied  to  pro-  and  post-instruction  with  difficulty  indexes  derived 
in  the  same  fashion.  His  findings  confirmed  that  caution  should  be  observed 
in  the  use  of  traditional  item  analysis  procedures  in  CRT.  In  a  similar 
finding,  Roudabush  (ifp)  showed  that  use  of  traditional  item  statistics 
would  have  resulted  in  some  objectives  being  over-represented  while  others 
would  be  represented  by  no  items. 
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Ozer.ne  has  developed  an  elaborate  model  of  subject  response 

which  he  uses  to  derive  an  index  of  sensitivity.  In  this  formulation  the 
sensitivity  of  a  group  of  comparable  measures  given  to  a  sample  of  £>'s 
before  and  after  instruction  is  the  variance  due  to  the  instructional 
effect  divided  by  the  sum  of  the  variance  due  to  the  instructional  effect 
and  error  variance.  The  index  was,  however,  developed  for  a  severely 
restricted  sample  to  allow  an  analysis  of  variance  treatment.  Further 
development  is  indicated  before  the  technique  has  general  usefulness  for 
sensitivity  measurement  or  item  selection. 

New  procedures  have  been  developed  for  item  analysis  for  specific 
cases  of  CRTs  but  evidence  as  to  their'  generalizability  is  lacking.  If 
item  analytic  procedures  are  to  be  used  in  evaluating  CRTs,  then  it  must 
be  Known  what  sort  of  score  is  produced  by  that  item.  The  usual  score  is 
a  pass-fail  dichotomy.  A  CRT  item  can  result  in  two  types  of  incorrect 
decisions.  Roudabush  and  Green  (1^-)  refer  to  these  errors  as  "false 
positives"  and  "false  negatives".  In  this  view,  reliability  is  concerned 
with  the  CRT's  ability  to  consistently  make  the  same  decision.  Consequently, 
validity  becomes  the  ability  of  the  CRT  to  make  the  "right"  decision,  i.e., 
avoiding  false  negatives  and  false  positives.  The  adequacy  of  a  CRT  in  these 
authors'  view  is  determined  by  its  ability  to  discriminate  consistently  and 
appropriately  over  a  large  number  of  items. 

Carver  (l'f’O)  proposed  two  procedures  to  assess  reliability  of  a  CRT 
item.  For  a  single  form  he  suggests  comparing  the  percentage  meeting 
criterion  level  in  one  group  to  the  same  percentage  in  another  "similar" 
group;  for  homogeneous  sets  he  recommends  using  one  group  and  comparing  the 
percentages  identified  as  meeting  criterion  on  all  items.  Meredith  and  Sabers 
(1  *("■')  point  out,  however,  that  it  must  be  determined  how  two  CRT  items, 
whether  identical  or  parallel,  identify  the  same  individual  with  regard  to 
his  attainment  of  criterion  level.  With  regard  to  item  analysis  procedures, 
if  a  CRT  item  is  administered  before  and  after  instruction,  and  it  does  not 
discriminate,  there  are  alternatives  to  labeling  it  unreliable.  A  non¬ 
discriminating  item  may  simply  be  an  invalid  measure  of  the  objectives  or  it 
may  indicate  that  the  instruction  itself  is  inadequate  or  unnecessary. 

Meredith  and  Sabers  suggest  the  use  of  a  matrix  consisting  of  the  pass- 
fail  decisions  of  two  CRTs.  By  defining  the  two  CRT  items  as  being  the  same 
measures  wc  can  examine  test/rc-tcst  reliability,  but  without  time  inter¬ 
vening  between  the  measures,  the  reliability  is  of  the  concurrent  or  internal 
consistency  variety.  In  addition,  undefined  problems  exist  with  acceptably 
defining  two  CRTs  as  the  same.  Various  other  indc>?s  are  possible  but  a 
great  weight  is  placed  upon  carefully  defining  relationships  between  measures 
a  priori.  Considerable  confusion  is  evidence  in  the  use  of  "same"  and 
parallel  forms  without  formal  definitions.  Similarly  it  is  stated  that  if 
one  CRT  item  is  a  "criterion  measure",  then  the  validity  of  the  other  CRT 
can  bo  found.  By  definition,  both  are  criterion  measures  and  if  t.  "criterion 
measure"  is  external  to  the  instructional  domain,  then  it  is  not  a  CRT  <  ♦'em 
in  the  same  sense.  Various  coefficients  are  given  but  the  difficulty  in 
definition  mentioned  above  limits  their  usefulness. 


FIDELITY 


Frederiksen  (19ol?>  has  proposed  a  hierarchical  model  for  describing 
levels  of  fidelity  in  performance  evaluation.  Frederiksen  has  identified 
six  categories: 

1.  Solicit  opinions.  This  category,  the  lowest  level,  man  in  fact 
often  miss  the  payoff  questions  (e.g.,  to  what  extent  has  the  behavior 
of  trainees  been  modified  as  a  function  of  the  instructional  processK 

Administer  attitude  scales.  This  technique,  although  psycho- 
raetrically  refined  via  the  work  of  Thurstone,  Likert,  Guttman,  and  others, 
assesses  primarily  a  psychological  concept  (attitude)  which  can  only  be 
presumed  to  be  concomitant  with  performance. 

5.  Measure  knowledge.  This  is  the  most  commonly  used  method  of 
assessing  achievement.  This  technique  is  usually  considered  adequate  only 
if  the  training  objective  is  to  produce  knowledge  or  if  highly  defined, 
fixed  procedure  tasks  are  involved. 

k.  Elicit  related  behavior.  This  approach  is  often  used  in  situations 
where  practicality  dictates  observation  of  behavior  thought  to  be  logically 
related  to  the  criterion  behavior. 

Elicit  "What  Would  I  Do"  behavior.  This  method  involves  presenta¬ 
tion  of  brief  descriptions  or  scenarios  of  problem  situations  under  simulated 
predesigned  conditions;  the  subject  is  required  to  indicate  how  he  would 
solve  the  problem  if  he  were  in  the  situation. 

b.  Elicit  lifelike  behavior.  Assessment  under  conditions  which  approach 
the  realism  of  the  real  situation. 

Measurement  at  any  of  the  six  levels  proposed  by  Frederiksen  possesses 
both  advantages  and  disadvantages.  An  optimal  solution  would  be  to  assess 
individual  performance  at  the  highest  possible  level  of  fidelity.  Unfortu¬ 
nately,  deriving  performance  data  may  involve  a  subjective  (rating)  technique 
for  a  specific  situation,  requiring  a  subjectivity  vs.  fidelity  tradeoff. 

In  order  to  minimize  subjectivity,  it  may  be  necessary  to  decrease  the  level 
of  fidelity  so  that  more  objective  measurements  (such  as  time  and  errors)  can 
be  obtained.  Tnese  measures  can  be  conceptualized  as  surrogates  that  in  some 
sense  embody  real  criteria  but  have  the  virtue  of  measurability  (Rapp,  Root, 
and  Sumner,  1'JJO) .  An  actual  increase  in  overall  criterion  adequacy  may  rcsul 
from  a  gain  in  objectivity  which  may  compensate  for  a  corresponding  loss  in 
fidelity. 

Tlie  question  of  fidelity  addresses  the  issue  of  how  much  should  the  test 
resemble  the  actual  performance.  Fidelity  is  not  usually  at  issue  in  NRT  and 
has  its  primary  application  in  critei ion-referenced  performance  tests.  There 
are  trades  to  be  made  between  fidelity  and  cost.  A  more  salient  issue, 
however,  is  how  to  empirically  modify  face  fidelity  to  satisfy  needs  of  the 
testing  situation  while  retaining  the  essential  stimuli  and  demand  character¬ 
istics  of  the  real  performance  situation. 


Osborne  (1  ro^  addresses  problems  in  finding  efficient  alternatives  to 
work  sample  tests.  Osborn  was  concerned  with  developing  a  methodology  that 
would  .allow  derivation  of  cheaper  procedures  that  would  preserve  content 
validity.  There  are  many  realistic  situations  where  job  sample  tests  are 
not  feasible,  and  job-knowledge  tests  are  not  relevant.  Obviously  the 
existence  of  intermediate  measures  would  be  a  great  boon  to  evaluating 
performance  in  this  situation.  However,  methods  for  developing  inter¬ 
mediate  or  "synthetic"  measures  are  lacking.  Osborn  gives  a  brief  outline 
of  a  method  for  developing  these  synthetic  measures.  Osborn  presents  a 
two  way  matrix  defined  by  methods  of  testing  terminal  performance  {simple 
to  complex^  and  component  \ enabling)  behaviors.  This  matrix  serves  as  a 
decision-making  aid  by  allowing  the  test  constructor  to  choose  the  test 
method  most  cost-effective  £o-  each  behavior.  The  tradeoff  that  must  be  made 
between  test  relevance,  related  diagnostic  performance  data,  and  ease  of 
administration  and  cost  is  obvious,  and  must  be  resolved  by  the  judgement  of 
the  test  constructor.  Osborn's  notions  are  intriguing  but  much  more  develop¬ 
ment  is  needed  before  a  workable  method  for  deriving  synthetic  performance 
tests  is  available. 

Vineberg  and  Taylor  il  *7 address  a  topic  allibd  to  the  fidelity  issue, 
that  .s:  to  what  extent  can  job  knowledge  tests  be  substituted  Cor  perfor¬ 
mance  tests.  Tractical  considerations  have  often  dictated  the  use  of  paper 
and  pencil  job  knowledge  tests  because  they  are  simple  and  economical  to 
administer  and  easy  to  score.  However,  the  use  of  paper  and  pencil  tests  to 
provide  indexes  of  individual  performance  is  often  considered  to  be  poor 
practice  by  testing  "exports".  HumRRO  research  under  Work  Unit  UTILITY 
compared  the  proficiency  of  army  men  at  different  ability  levels  and  with 
different  amounts  of  job  experience.  This  work  provided  Vineberg  and 
Taylor  with  an  opportunity  to  examine  the  relationship  between  job  sample 
test  scores  and  job  knowledge  test  scores  in  four  U.S,  Army  jobs  that 
varied  greatly  in  job  type  and  task  complexity.  Vineberg  and  Taylor  found 
that  job  knowledge  tests  are  valid  for  measuring  proficiency  in  jobs  where: 

1^  skill  components  are  minimal,  and  Job  knowledge  tests  are  carefully 
constructed  to  measure  only  that  information  that  is  directly  relevant  to 
performing  the  Job  at  hand.  Given  the  high  costs  of  obtaining  performance  , 
data,  these  findings  Indicate  that  Job  knowledge  tests  are  indicated  where 
skill  requirements  are  determined  by. careful  lob  analysis  to  be  minimal. 

In  a  similar  work,  Engel  and.Rehder  (l'TOl  computed  peer  ratings,  a 
job  knowledge  test,  and  a  work-sample  test.  These  workers  found  that  while 
the  knowledge  test  Was  acceptably  reliable,  it  lacked  validity,  and  reading 
ability  tended  to  enter  into  performance.  Peer  ratings  were  judged  to 
have  unacceptable  validity.  Ratings  were  also  essentially  uncorrelated 
with  the  written  lest.  The  troubleshooting  items  on  the  written  test 
exhibited  a  moderate  but  useful  level  of  validity,  while  the  corrective- 
action  items  had  little  validity.  Finally,  Engel  and  Rehder  note  that  the 
work-sample  is  the  most  costly  method  and  is  difficult  to  administer,  while 
the  peer  ratings  and  written  tests  were  the  least  costly  and  were  easy  to 
administer. 


Osborn  fl'/7'3)  discusses  an  important  topic  related  to  both  the  validity 
and  fidelity  of  a  CRT.  Osborn  points  out  that  task  outcomes  and  products 
are  used  to  assess  student  performance  while  measures  of  how  the  tasks  are 
done  (processes)  pertain  to  the  diagnosis  of  instructional  systems.  Time 
or  cost  factors  sometimes  preclude  the  use  of  product  measures,  thus  leaving 
process  measures  as  the  only  available  criteria.  There  are  cases  where  this 
focus  on  process  is  legitimate  and  useful  b^  t  many  where  it  is  not.  Osborn 
developed  three  classes  of  tasks  to  illustrate  what  the  relative  roles  of 
product  and  process  measurement  should  be. 

1.  Tasks  where  the  product  i_s  the  process. 

2.  Tasks  in  which  the  product  always  follows  from  the  process. 

7).  Tasks  in  which  the  product  may  follow  from  the  process. 

Relatively  few  tasks  are  of  the  first  type.  Osborn  offers  gymnastic 
exercises  or  springboard  diving  as  examples.  More  tasks  are  of  the  second 
type,  i.e.,  fixed  procedure  tasks.  In  these  tasks,  if  the  process  is 
correctly  executed  the  product  follows.  A  great  many  tasks  are  of  the  third 
type  where  the  process  appears  to  have  been  correctly  carried  out  but  the 
product  was  not  attained.  Osborn  offers  two  reasons  why  this  can  happen: 
either,  l'l  we  were  unable  to  specify  fully  the  necessary  and  sufficient  steps 
in  task  performance,  or  2)  because  we  do  not  or  cannot  accurately  measure 
them.  An  example  of  aim-firing  a  rifle  is  given  as  an  illustration  that  there 
is  no  guarantee  of  acceptable  markmanship  even  if  ail  procedures  are  followed. 
In  this  case,  process  medsurement  would  not  adequately  substitute  for  product 
measurement.  For  tasks  of  the  first  two  types,  Osborn  concludes  that  it 
really  doesn't  matter  which  measure  is  used  to  assess  proficiency;  but  for 
tasks  of  the  third  type,  product  measurement  is  indicated.  Osborn,  however, 
discusses  a  number  of  type  5  tasks  where  product  measurement  is  impractical 
because  of  cost,  danger,  or  practicality.  In  these  cases  process  measures 
would  come  to  be  substituted  with  resulting  injury  to  the  validity  of  the 
measure.  Osborn  poses  a  salient  question  that  the  test  developer  must  answer: 
If  I  use  only  a  process  measure  to  test  a  man's  achievement  on  a  task,  how 
certain  can  I  be  from  this  process  score  that  he  would  also  be  able  to 
achieve  the  product  or  outcome  of  the  task?  Osborn  holds  that  where  the  degree 
of  certainty  is  substantially  less  that  that  to  be  expected  by  errors  of 
measurement,  the  test  developer  should  pause  and  reconsider  ways  in  which  times 
and  resources  could  be  compromised  in  achieving  at  least  an  approximation  to 
product  measurement.  Osborn  concludes  by  noting:  The  accomplishment  of 
product  measurement  is  not  always  a  simple  matter;  but  it  is  a  demanding  and 
essential  goal  to  be  pursued  by  the  performance  test  developer  if  his  products 
are  to  be  relevant  to  real  world  behavior.  Swczey  (l'/fh)  has  also  addressed 
process  versus  product  measurement,  and  assist  versus  non-interference  methods 
of  scoring  in  CRT  development.  Swezey  has  recommended  process  measurement 
in  addition  to,  or  instead  of,  product  measurement  when:  Diagnostic  informa¬ 
tion  is  desired,  when  additional  scores  are  needed  on  a  particular  task,  arid 
when  there  is  no  product  at  the  end  of  the  process. 


An  issuo  which  must  bo  faced  when  constructing  a  complex  CRT  is  the 
bandwidth  fidelity  problem  (Cronback  and  Closer,  i.e.,  the  question 

of  whether  to  obtain  precise  information  about  a  small  number  of  competen¬ 
cies  or  less  precise  about  a  larger  number.  Hambleton  and  Novick  (l'/fl) 
conclude  that  the  problem  of  how  to  fix  the  length  of  each  sub-scale  to 
maximize  the  percentage  of  correct  decisions  on  the  basis  of  test  results 
has  yet  to  be  resolved  or  even  satisfactorily  defined. 

ISSUES  RELATED  TO  CRT  CONSTRUCTION 

„ l though  construction  methodology  for  NRT  is  well  established  and 
highly  specified,  the  construction  of  CRT  has  been  much  more  of  an  art. 

There  have  been,  however,  several  attempts  to  formalize  the  construction 
of  CRT.  Ebel  si  k\'^  describes  the  development  of  a  criterion-referenced 
test  of  knowledge  of  word  meanings.  Three  steps  were  involved. 

1.  Specification  of  the  universe  to  which  generalization  is  desired. 

...  A  systematic  plan  for  sampling  from  the  universe. 

A  standardized  method  of  item  development. 

These  characteristics  together  serve  to  define  the  meaning  of  test  scores.. 

To  the  extent  that  scores  are.  reproducible  on  tests  developed  independently 
under  the  same  procedures,  the  scores  may  be  said  to  have  inherent  meaning. 
Flanagan  1  h ^  indicates  that  a  variant  of  Ebel’s  procedure  was  used  in 
project  TALENT.  The  tests  used  in  the  areas  of  spelling,  vocabulary,  and 
reading  were  not  based  on  specific  objectives.  They  were,  however,  developed 
by  systematically  sampling  a  relevant  domain.  Fremer  and  Anastasio  (1»0')) 
also  put  forth  a  method  for  systematically  generating  spoiling  items  from 
a  specified  domain. 

Osburn  Ik'’)  notes  two  conditions  as  prerequisites  for  allowing 
inferences  to  be  made  about  a  domain  of  knowledge  from  performance  on  a 
co  l  lection  of  items. 

1.  All  items  that  could  possibly  appear  on  a  test  should  be  specified 
in  advance . 

The  items  in  a  particular  test  should  be  selected  by  random 
sampling  from  the  content  universe. 

It  is  rarely  feasible  to  satisfy  the  first  conditions  in  any  complete 
fashion  for  complex  behavior  domains.  However,  the  problem  of  testing  all 
items  can  bo  overcome  at  least  in  a  highly  specified  content  area  by  the 
use  of  an  Item  form  (Hively,  Patterson,  and  Page,  l'V  v,  Osborn,  l\k\'') . 

The  item  form  generally  has  the  following  characteristics  (Osborn,  l'.k\‘$) . 

1.  It  generates  items  with  a  fired  syntactical  structure. 

It  contains  one  or  more  variable  elements. 


It  de-fines  a  class  of  item  sentences  by  specifying  the  replacement 
sets  for  the  variable  elements. 

Shoemaker  and  Osburn  describe  a  computer  program  capable  ot 

generating  both  random  and  stratified  random  parallel  tests  from  a  well- 
defined  and  rule-bound  population.  However,  generalizing  these  results  to 
other  domains  has  led  to  the  finding  that  the  difficulty  of  objectively 
defining  a  test  construction  process  is  directly  related  to  the  complexity 
of  the  behavior  the  test  is  designed  to  assess  , Jackson,  17 .‘O'  .  Where  the 
domain  is  easily  specified  as  in  spelling,  the  construction  process  is 
simplified. 

It  appears  that  at  the  current  state-of-the-art,  it  is  difficult  to 
develop  the  objective  procedures  necessary  for  criterion-referenced 
measurement  of  complex  behavior  without  doing  violence  to  measurement 
objectives.  What  is  needed  for  complex  content  domains  are  item  generating 
rules  that  permit  generalizations  of  practical  significance  to  be  made. 

Jackson  si  '"O'  concludes,  "For,  complex  behavior  domains,  it  appears  that 
at  least  until  explicit  models  stated  in  measurable  terms  are  developed,  a 
degree  of  subjectivity  in  test  construction  and  attendant  population- 
referenced  scaling'  will  be  required."  The  best  approach  appears  to  be  the 
use  of  a  detailed  test  specification  which  relates  test  item  development 
processes  to  behavior. 

Edgerton  ;l7'’i'  has  suggested  that  the  relationships  among  instructional 
methods,  course  content  and  item  format  have  not  been  adequately  explored. 

Item  format  should  require  thinking  and/or  performing  in  the  patterns  sought 
by  the  instructional  methods.  If  the  instruction  is  aimed  at  problem  solving, 
then  the  items  should  address  problem  solving  tasks  and  not,  for  example, 
knowledge  about  the  required  background  content.  Edgerton  feels  that  if  one 
mixes  styles  of  items  in  the  same  test,  one  runs  the  risk  of  measuring 
"test  taking  skill"  instead  of  subject  matter  competence. 

In  a  practical  application,  Osborn  suggests  fourteen  steps  in  the 

course  of  developing  a  test  for  training  evaluation.  The  first  three  steps 
have  to  do  with  assembling  information  concerning  the  skills  and  knowledge, 
segments,  the  relative  importance  of  each  objective,  and  the  completeness  of 
each  objective.  In  step  -'i  the  developer  should  obtain  classification 
concerning  measuring  of  confusing  e lemon ts .  Osborn  points  out' that  perfor¬ 
mance  standards  are  generally  a  source  of  trouble.  Steps  concern  them¬ 
selves  with  developing  the  test  items  and  answering  questions  of  the  feasibilit 
of  simulation  as  well  as  questions  of  controlled  administration.’  In  step  >, 
a  final  aspect  of  measurement  reliability  is  considered.  Here  procedures 
for  translating  observed  performance  into  a  pass-fail  score, must  be  developed. 
Unfortunately,  Osborn  does  not  tell  us  how  to  develop  pass-fail  criteria  that 
will  generalize  to  trainees'  performance  in  the  field.  In  step  10  a 
supplementary  scoring  procedure  is  developed  for  diagnosing  reasons  for  trainee 
failure.  Osborn  does  not  say  if  this  is  to  be  a  criterion-  or  norm-referenced 
interpretation.  In  step’ll  the  developer-  formats  the  final  item  with  its 
instruction,  scoring  procedures,  etc.  In  step  1.'  a  decision  is  made  as  to 
whether  time  permits  testing  on  all  objectives  or  if  a  sample  should  be  used. 


Step  li  covers  sampling  procedures  based  on  the  criticality  of  the  behavior. 

In  step  1L  guidance  for  test  administration  is  prepared.  Osborn  has  provided 
the  developer  of  CRTs  with  a  broad  outline  of  the  steps  to  be  taken  in  item 
development.  Unfortunately,  he  does  hot  provide  much  detail  on  how  various 
decisions  are  to  be,  made,  i.e.,  what  are  passing  scores,  how  to  simulate,  etc. 
It  is  the  quality  of  these  decisions  that  determines  the  usefulness  of  the 
final  instrument  but  the  decision-making  process  apparently  remains  an  art. 


MASTERY  LEARNING 

Besel  (!b>7^a,b''  contends  that  norm-group  performance  is  useful  and 
legitimate  information  for  the  construction  and  application  of  CRT.  Besel 
defines  a  CRT  as  a  set  of  items  sampled  from  a  domain  which  has  been  judged 
to  be  an  adequate  representation  of  an  instructional  objective.  The  domain 
should  be  fully  described  so  as  to  allow  two  test  developers  to  independently 
generate  equivalent  item?  which  measure  the  same  content  and  are  equally 
reliable.  A  degree  of  arbitrariness  creeps  in  when  a  mastery  level  is  specified 
for  a  given  objective  or  set  of  objectives.  Besel  recommends  the  "Mastery 
Learning  Test  Model"  to  provide  an  appropriate  algorithm  to  support  mastery/ 
non-mastery  decisions.  TVo  statistics  are  computed:  The  probability  that  a 
student  has  indeed  achieved  the  objective  and  the  proportion  of  a  group  which 
has  achieved  the  objective.  Hie  model  assumes  that,  each  student  can  be 
treated  as  either  having  achieved  the  objective  or  not  having  achieved  the 
objective  with  partial  achievement  possible.  The  Mastery  Learning  Test  Model 
and  its  underlying  true  score  theory  is  related  to  a  notion  enunciated  by 
Emrick  Emrick  assumed  that  measurement  error  was  attributable  to  two 

sources:  a,  the  problem  that  a  non-master  will  correctly  answer  an  item 
"false  positive"'  and  the  probability  that  a  master  will  give  an 
incorrect  answer  to  an  item  v"falsc  negative"} .  These  constructs  resemble  the 
Type  I  and  Type  II  errors  encountered  in  discussions  of  statistical  inference. 
Emrick’s  model  assumes  chat  all  item  difficulties  and  inter-item  correlations 
are  equal,  a  difficult  assmption  in  view  of  the  assumed  variability  of  the 
former  as  a  result  of  instruction  and  the  difficulties  in  computing  the  latter. 
Besel  (l'/p  a,  b^  had  developed  algorithms  for  estimating  a  and  f*.  Three 
data  sources  are  used: 

1.  item  difficulties 

Inter-item  co-variance 

5.  Score  histograms 

In  a  tryout,  Besel  reports  "that  the  uSage  of  an  independent  estimate 
of  the  proportion  of  students  reaching  mastery  resulted  in  improved  stability 
of  Mastery  Learning  parameters.''  This  improved  stability  of  A  and  B  should 
promote  increased  confidence  in  mastery/non-mastery  decision.  . Besel' s 
computational  procedures  are,  however,  quite  involved,  using  a  multiple 
regression  approach  which  requires  independent  a  priori  estimates  of  variance 
due  to  conditions.  Besel  also  points  out  that  B  is  estimated  best  for  a 
group  when  the  mastery  level  is  lowered  while  the  reverse  is  true  for  A.  In 
Other  words,  Besel  has  empirically  established  a  relationship  between  errors 


of  misclassification  and  criterion  level.  A  decision,  however,  has  not  been 
made  concerning  the  relative  cost/effectiveness  of  the  competing  errors  of 
misclassification.  These  decisions  may  have  to  be  made  individually  for 
each  instructional  situation. 


ESTABLISHING  AND  CLASSIFYING  INSTRUCTIONAL  OBJECTIVES 

The  development  of  student  performance  objectives  for  instructional 
programs  has  become  a  widespread  and  we  11 -understood  process  throughout 
the  educational  community.  For  quality  control  of  the  conventional  process 
crucial  information  derives  directly  from  instructional  objectives;  they 
provide  not  only  the  specifications  for  instruction,  but  also  the  basis  for 
evaluating  instruction  \  Lyons,  1  r'.  N  .  Ammerman  and  Melching  ■,  1  k-v  trace  the 
interest  in  behavioral ly  stated  objectives  from  three  independent  movements 
within  education.  The  first  derives  from  the  work  of  Tyler  1 1  ■ 1'k'!, ' 
and  his  associates  who  worked  for  over  V-  years  at  specifying  the  goals  of 
education  in  terms  of  what  would  be  meaningful  and  useful  to  the  classroom 
teacher.  Ty.ler’s  work  has  had  considerable  impact  in  the  trend  toward 
describing  objectives  in  terms  of  instructional  outcomes. 

The  second  development  has  come  from  the  need  to  specify  man-machine 
interaction  in  modern  defense  equipment.  Miller  was  responsible  for 

pioneering  efforts  in  developing  methods  for  describing  and  analyzing  job 
tasks.  Chenzoff  i.  1  •  H  di  ^  reviewed  the  then  exact  methods  in  detail  an  I  many 
more  have  appeared  since  that  date.  More  recently  Davies  <1''”^'  classified 
task  analysis  schemes  into  six  categories: 

1.  Task  analysis  based  upon  objectives,  which  involves  analysis  of  a 
task  in  terms  of  the  behaviors  required,  i.e.,  knowledge,  comprehension,  etc. 

...  Task  analysis  based  upon  behavioral  analysis,  i.e.,  chains,  concepts, 

etc . 


k .  Task  analysis  based  on  information  processing  needs  for  performance, 
i.e.,  indicators,  uses,  etc. 

L ,.  Task  analysis  based  on  a  decision  paradigm  which  emphasizes  the 
judgement  and  decision-making  rationale  of  the  task'. 

'. •  Task  anaLysis  based  upon  subject  matter  structure  of  a  task. 

b.  Task  analysis  based  upon  vocational  schematics  which  involve  analysis 
of  jobs,  duties,  tasks  and  task  elements. 

The  point  of  Davies'  breakdown  is  that  there  is  no  one  task  analysis 
procedure.  Hie  general  approach  is  to  "gin  up"  a  new  task  analysis  scheme  or 
modify  art  existing  scheme  to  suit  the  needs  of  the  job  at  hand. 


The  third  development  was  the  concept  of  programmed  instruction  which 
required  the  writers  of  programs  to  acquire  specific  information  in 
instructional  objectives. 

It  is  apparent  that  these  initial  phases  of  development  have  largely 
merged,  and  the  use  of  instructional  objectives  has  become  accepted 
educational  practice.  A  critical  event  in  this  fusion  was  the  publication  of 
Hager's  little  book  Preparing  Instructional  Oblcctives,  In  this 

work,  Mager  set  forth  the  requirements  for  the  form  of  a  useful  objective  but 
he  did  not  deal  with  the  procedures  by  which  one  could  obtain  the  information 
to  support  preparation  of  the  objectives.  A  series  of  additional  works 
including  one  on  measuring  instructional  intent  (Mager,  1/75)  have  dealt  more 
thoroughly  with  such  issues. 

Information  as  to  the  actual  behaviors  exhibited  by  an  acceptable 
performer  is  preferred  as  the  basis  for  the  construction  of  an  instructional 
objective.  However,  data  can  come  from  a  variety  of  sources,  such  as: 

1.  Supervisor  interview 

r.  Job  incumbent  interview 

5.  Observation  of  performer 

.  Inferences  based  on  system  operation 

%  Analysis  of  "real  world"  use  of  instruction 

u.  Instructor  interview 

The  methods  used  to  derive  this  data  are  legion  and  have  become  very 
clever  and  sophisticated.  Flanagan's  ( l'/i '))  "critical  incident  technique" 
and  the  various  modifications  and  off-shoots  it  has  inspired  is  a  good 
example  of  an  effort  aimed  rt  identifying  essential  performance  while 
eliminating  information  not  directly  related  to  the  successful  accomplish¬ 
ment  of  a  job-related  task. 

The  choice  of  method  for  deriving  job  behavior  instruction  must  be  based 
on  the  type  of  performance  and  various  realistic  factors  such  as  the 
assessibility  of  the  performance  to  direct  observation.  Generally  the 
solution  is  less  than  ideal,  but  techniques  such  as  Ammerraan  and  Melching's 
(1  *6o)  can  be  used  to  review  the  objectives  so  derived  and  provide  a 
useful  critique  of  the  data  collection  method.  An  exhaustive  review  of  the  . 
various  techniques  for  deriving  instructional  objectives  is  impossible  here. 
The  reader  is  directed  to  Lindvall  (l/oM  and  Smith  for  a  comprehensive 

treatment  of  this  question. 

Ammerman  and  Melching  (l')66)  have  developed  a  system  for  the  analysis 
and  classification  of  terminal  performance  objectives.  Ammerman  and  Melching 
examined  a  great  number  of  objectives  generated  by  different  agencies  and 
concluded  that  five  factors  accounted  for  the  significant  ways  in  which  most 
existing  performance  objectives  differed.  These  factors  are: 
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1.  Type  of  performance  unit 

Extent  ot  action  description 

A.  Relevancy  of  student  action 

I*.  Completeness  of  structural  components 

Precision  of  each  structural  component 

Further,  Ammerman  and  Melching  have  identified  a  number  of  levels  under 
each  factor.  For  instance,  factor  t'l  has  three  levels  from  specific  task 
which  involves  one  well-defined  particular  activity  in  a  specific  work 
situation  to  generalized  behavior  which  refers  to  a  general  measure  of 
performance  or  way  of  behaving,  such  as  the  work  ethic. 

With  these  five  factors  and  the  identification  of  levels  for  each 
factor,  it  is  possible  to  classify  or  code  any  terminal  objective  by  a  five 
digit  number.  This  scheme  has  high  value  for  management  control  and  review 
of  terminal  performance  objectives.  Anmerman  and  Melching  feel  the  method 
can  fulfill  three  main  purposes: 

1.  Provision  of  guidance  for  the  derivation  of  objectives  and 
standardization  of  statements  of  objectives  so  that  all  may  meet  the 
criteria  of  explicitness,  relevance^  and  claritv. 

2.  Evaluating  the  proportion  of  objectives  dealing  with  specific  or 
generalized  action  situations. 

?.  Evaluating  the  worth  of  a  particular  method  for  deriving  objectives. 

This  is  an  extremely  useful  method,  particularly  where  a  panel  of  judges 
is  used  to  review  each  objective.  A  coefficient  of  congruence  can  be 
computed  between  the  judge's  placement  of  the  objective  on  the  five  dimen-1 
sions  to  yield  a  relative  index  of  agreement.  Used  in  this  fashion,,  the 
Ammerman  and  Melching  method  should  prove  to  be  very  useful  in  development 
of  instructional  systems. 

DEVELOPING  TEST  MATERIALS  AND  ITEM  SAMPLING 

Hively  and  his  associates  .(l;Xx^,  1^/75)  provide  a  useful  scheme  for 
writing  items  which  are  congruent  with  criterion.  Hivelj  effort  has  been 
in  the  area  of  domain-referenced  achievement  testing.  In  Hively 's  system, 
an  item  form  constitutes  a  complete  set  of  rules  for  generating  a  domain  of 
test  items  which  are  accurate  measures  of  an  objective.  Ponham  points 

out.  that  this  approach  has  met  with  success  where  the  content  area  has  well- 
defined  limits.  In  areas  such  as  mathematics,  independent  judges  tend  to 
agree  on  whether  a  given  item  is  congruent  with  the  highly  specific  behavior 
domain-referenced  by  the  item  form.  As  less  well-defined  fields  are 
approached,  however,  it  becomes  very  difficult  to  prepare  item  forms  so 
that  they  yield  test  items  which  can  be  subsequently  judged  congruent  with 
a  given  instructional  objective.  Easy  interjuage  agreement  tends  to  fade 
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and  the  items  become  progressively  more  cumbersome.  Popham  (17, '0)  remarks: 
"Perhaps  the  best  approach  to  developing  adequate  criterion-referenced  test 
items  will  be  to  sharpen  our  skill  in  developing  item  forms  which  are 
parsimonious  but  also  permit'  the  production  of  high  congruency  test  items." 

Cronbach  l  1  r  '\  presents  a  generalizabi  1  i ty  theoretic  approach 
to  achievement  testing.  Cronbach's  theory  presents  a  mathematical  model  in 
the  framework  of  which  an  achievement  test  is  assumed  to  be  a  sample  from  a 
large  well-defined  domain  of  items.  Parallel  test  forms  are  obtained  by 
repeated  sampling  according  to  a  plan.  Analysis  of  variance  techniques 
particularly  intra-class  correlation)  are  used  to  obtain  estimates  of 
components  of  variance  due  to  sampling  error,  testing  conditions,  and  other 
sources  which  may  affect  the  reliability  of  the  score.  It  should  be  pointed 
out  tnat  analysis  of  variance,  when  used  in  this  fashion,  is  essentially  a 
nun-parametric  technique  particularly  suitable  for  use  with  CRTs.  Generali za- 
bility  theory  has  been  extended  (Osburn,  I  Hj'  )  by  including  the  concepts  of 
task  analysis  which  allows  sorting  subject  matter  into  well-defined 
behavioral  classes.  Osburn  (17»x')  has  termed  this  convergence  "Universe- 
defined  achievement  testing".  Hively  et  al.  ll\KV,  lv7~i'>  has  used  these 
techniques  in  an  exploration  of  the  mathematics  curriculum.  Mathematics 
represents  a  subject  domain  particularly  suited  to  this  approach  and  Hively 
reported  success  as  evidenced  by  him  in  the  high  intra-class  correlations 
between  sets  of  items  sampled  from  a  universe  of  items.  If  applicable  to  less 
well-defined  content  domains,  this  technique  promises  to  have  diagnostic 
utility  and  also  particular  relevance  to  examining  the  form  of  relationships 
between  knowledges  and  skills.  As  yet,  this  extension  into  other  subjects 
has  not  been  undertaken 


QUALITY  ASSURANCE 

In  the  view  of  Hanson  and  Berger  (1771)  quality  assurance  is  viewed  as 
a  means  for  maintaining  desired  performance  levels  during  the  operational 
use  of  ,a  large  scale  instructional  program.  These  workers  identify  six 
major  components  in  a  Quality  Assurance  program: 

1.  Specification  of  indicator  variables.  These  are  variables  which 
measure  the  important  attributes  of  aspects  of  a  program  and  must  be 
individually  defined  for  each  instructional  system. 

Examples  given  are: 

a.  Pacing--measure  of  instructional  time 

b.  Performance--interim  measures  of  learning,  i.e.,  unit  tests, 
module  tests,  etc. 

c.  Logistics--indicator  reports  of  failure  to  deliver  materials. 


etc . 


2.  Definition  of  decision  rules.  The  emphasis  here  should  be  on 
indicators  which  signal  a  major  program  failure.  Critical  levels  may  be 
determined  on  the  basis  of  evidence  from  developmental  work  or  on  the  basis 
of  an  analysis  of  program  needs. 

%  Sampling  procedures.  These  questions  must  be  answered  on  the  basis 
of  an  analysis  of  the  severity  of  effects  if  sufficient  information  is  avail¬ 
able.  Factors  to  be  considered  include: 


a.  Number  of  program  participants  to  provide  data 

b.  How  to  allocate  sampling  units 


c.  Amount  of  information  from  each  participant 

k.  Collecting  quality  assurance  data.  Special  problems  here  concern 
the  willingness  of  participants  to  cooperate  in  the  data  gathering  effort. 
Data  must  be  timely  and  complete.  Hanson  and  Berger  suggest  a  number  of 
ways  to  reduce  data  collection  problems: 

a.  Minimize  the  burden  on  each  participant  by  collecting  only 
required  data. 

b.  Use  thoroughly  designed  forms  and  simplified  collection 
procedures. 


c.  Include  indicators  which  can  be  gathered  routinely  without 
special  effort. 


l).  Analysis  and  summarization  of  data.  Some  data  may  be  analyzed  as  it 
comes  in;  other  data  may  have  to  be  compiled  for  later  analysis.  The  exact 
technique  will  depend  on  the  type  of  decision  the  data  must  support.  I 


6.  Specification  of  actions  to  be  taken.  This  step  must  describe  th^ 
actions  to  be  taken  in  the  event  of  major  program  failure.  Alternatives 
should  be  generated  and  scaled  to  the  severity  of  the  failures.  Information 
as  to  actions  taken  to  correct  program  failures  should  always  be  fed  back 
into  the  program  development  cycle.  This  feedback  will  be,  an  important  source 
of  information  to  guide  program  revision. 

Hanson  and  Berger  offer  an  illustrative  example  of  how  this  process  might 
be  implemented.  They  conclude  by  noting  that  quality  assurance,  as  applied 
to  criterion-referenced  programs,  would  act  to  ensure  that  the  specified 
performance  levels  will  be  maintained  through  the  life  of  a  program.  These 
notions  provide  the  basis  of  an  important  concept  in  the  implementation  of 
an  instructional  program  utilizing  criterion-referenced  measurement.  If  this 
sort  of  internal  quality  assurance  program  is  built  into  the  instruction,  then 
the  probability  of  an  instructional  program  becoming  "derailed"  while  up  apd 
functioning  is  certainly  minimized. 


-  ;.\5  - 


DESIGNING  FOR  EVALUATION  AND  DIAGNOSIS 


Baker  feels  that  the  critical  factor  in  instruction  is  not  how 

the  cost  results  are  portrayed  (NRT  or  CRT)  but  how  they  are  obtained  and 
what  they  represent.  Baker  suggests  the  term  c ons true t-refe rone ed  to 
describe  achievement  tests  consisting  of  a  wide  variety  of  item  types  and 
well-sampled  content  range.  These  tests  are  results  of  the  norm-referenced 
type.  Criterion-referenced  tests.  Baker  feels,  are  probably  better  termed 
doi.ijjn- referenced  tests  (see  diesussion  of  Hively  et  al.,  A 

domain  specifies  both  the  performance  the  learner  is  to  demonstrate  as  well 
as  the' content  domain  to  which  the  performance  is  to  generalize.  Another 
subset  of  CRT  is  what  Baker  refers  to  as  the  ob iective-refercnced  te.st.  The 
objective-referenced  test  starts  with  an  objective  based  on  observable 
behavior  from  which  it  is  possible  to  produce  items  which  are  homogeneous 
yet  relate  to  the  objective.  Baker  feels  the  notion  of  domain-referenced 
tests  is  more  useful. 

Each  type  of  test  will  provide  different  information  to  guide  improve¬ 
ment  on  instructional  systems.  Construct-referenced  tests  will  provide 
information  regarding  a  full  range  of  content  and  behavior  relevant  to  a 
particular  construct.  Ihe  objective-referenced  test  will  provide  items 
which  exhibit  similar  response  requirements  relating  to  a  vaguely  deiined 
content  area.  The  domain-referenced  test  will  include  items  which  conform 
to  a  particular  response  segment,  as  well  as  to  a  class  of  content  to  which 
the  performance  is  presumed  to  generalize. 

Baker  (lyf?)  then  proposes  a  minimum  set  of  data  needed  to  implement  as 
instructional  improvement  cycle,. 

1.  Data  on  applicable  student  abilities 

?.  Ability  to  identify  deficiencies  in  student  achievement 

"•I.  Ability  to  identify  possible  explanation  for  deficiencies 

)i .  Ability  to  identify  alternative  remedial  sequences 
Ability  to  implement  sequence 

All  three  types  of  tests  provide  data  useful  for  set  1.  Construct- 
referenced  tests  are  probably  the  most  readily  available,  but  are  not 
administered  on  a  cycle  compatible  with  diagnosis  and  are  reported  in  a 
nomothetic  manner.  A  well-designed  objective-referenced  test  may  be  sched¬ 
uled  in  a  more  useful  fashion.  A  domain-referenced  test  provides  enabling 
information  to  allow  instructors  to  identify  what  the  students  were  able 
to  deal  with.  Identification  of  performance  deficiencies  (set  is 
theoretically  possible  with  all  three  sets  of  data.  However,  since  cut-offs 
arc  usually  arbitrary,  none  of  the  three  tests  will  give  adequate  information 


As  for  sets  *•,  V,  and  there  is  little*  in  the*  way  of  information 
yielded  by  any  of  the  throe  tests  which  would  aid  in  those  decisions. 

In  addition,  training  research  is  not  yet  well-advanced  in  those  areas,  nor 
does  the  information  always  reach  the  user  level.  In  addition,  incentives 
,are  lacking  since  most  accountability  programs  are  used  to  punish  defi¬ 
ciency  rather  than  to  promote  efficienty.  Of  the  three  test  types,  the 
domain-referenced  tests  give  program  developers  the  most  assistance,  for 
they  are  provided  with  clear  information  about  what  kind  of  practice  items 
are  in  the  area  of  content  and  performance  measured  by  the  test.  Also 
students  may  practice  on  a  particular  content  domain  without  contacting 
the  tost  items  themselves.  However,  Baker  points  out  domain-referenced 
items  are  hard  to  prepare,  mainly  Because  not  all  content  areas  are  analyzed 
in  a  fashion  to  allow  specification  of  the  behaviors  in  the  domain,  as  has 
been  noted  elsewhere. 


ESTABLISHING  PASSING  SCORES 

Prager,  Mann,  Burger,  and  Cross  discuss  the  cut-off  point  issue 

and  point  out  that  there  are  two  general  routes  to  travel.  Tin*  first  method 
involves  setting  an  arbitrary  overall  mastery  level.  The  trainee  either 
attains  at  least  criterion  or  not.  A  second  procedure  is  that  of  requiring 
all  trainees  to  attain  the  same  mastery  level  in  a  given  objective  but  to 
vary  the  levels  from  objective  to  objective,  depending  on  the  difficulty 
of  the  material,  importance  of  the  method  for  later  successful  performance, 
etc.  Tliis  second  method  seems  more  reflective  of  reality  hut  as  Prager  et  al. 
\1->: point  out  it  is  certainly  more  difficult  to  implement,  let  alone 
Justify,  specific  levels  that  have  been  decided  upon.  Prager  et  al.  believe 
that  for  handicapped  children,  at  least,  it  would  he  appropriate  to  set 
mastery  levels  for  each  child  relative  to  his  potential.  Nitko  s,l'r  1'  concurs 
and  suggests  different  cut-offs  for  different  individuals.  However,  the 
feasibility  of  individual  cut-offs  seems  doubtful.  Lyons  points  out 

that  standards  must  take  into  account  the  varying  critical l tv  of  the  tasks. 

Ilie  cr i t 1 ca 1 i ty  for  any  task  is  basically  •»u  assessment  of  the  effect  on  an 
operating  system  of  the  incorrect  performance  on  that  task.  Criticality 
must  be  determined  during  the  task  analysts  and  must  he  incorporated  into 
the  training  objective.  Unfortunately,, in  most  cases  the  criticality  of  a 
task  is  not  an  absolute  lodgement  and  the  selection  of  a  metric  for  criti¬ 
cality  becomes  somewhat  arbitrary. 

The  approach  to  reliability  advocated  by  Livingston  holds  some 

promise  for  determining  pass-fail  scores.  It  Livingston's  assumptions  are 
accepted  then  it  becomes  possible  to  obtain  Increased  measurement  reliabil¬ 
ity  by  varying  the  criterion  score.  It  the  criterion  score  is  set  so  that 
a  high  or  very  low  proportion  pass  then  we  will  obtain  reliable  measurement. 
Unfortunately,  It  is  not  often  possible  to  "play  around"  with  criterion  scores 
to  this  extent.  The  training  system  may  require  a  certain  number  passing 
and  the  criterion  score  is  usually  adjusted  to  provide  the  required  number. 


From  this  discussion  it  is  apparent  that  there  are  no  completely 
generalizable  rules  to  guide  the  setting  of  cut-off  scores.  The  cut-off 
must  be  realistic  to  allow  the  training  system  to  provide  a  sufficient 
amount  of  trained  manpower  at  some  realistic  level  of  competancc. 

Training  developers  setting  the  cut-off  score  must  therefore  consider  the 
abilities  of  the  trainee  population,  the  through-put  requirements  of  the 
training  system,  the  minimum  competence  requirement,  and  act  accordingly. 
Hie  use  of  summative  try-out  information  should  allow  a  realistic  solution 
to  the  cut-off  question  for  specific  applications. 


USES  OF  CRT  IN  NON-MILITARY.  EDUCATION  SYSTEMS 

Prager  ct  al.  describe  research  on  one  of  the  first  CRT  systems 

(Individual  Achievement  Monitoring  System  -  IAM3)  designed  for  the  handi¬ 
capped  and  designed  for  widespread  implementation.  Prager  et  al.  point  out 
that  standardized  tests  often  are  useless  when  applied  to  handicapped 
individuals.  They  are  simply  too  global  in  nature  to  be  of  much  use  in 
directing  remediation.  Tests  build  to  reflect  specific  instructional 
objectives  arc  much  more  useful  when  dealing  with  such  populations.  The 
use  of  CRTs  also  allows  relating  a  handicapped  child's  progress  to  criterion 
tasks  and  competency  levels.  The  use  of  CRTs  is  further  Indicated  by  the 
need  for  individualized  instruction  and  individualized  testing  when  dealing 
with  individuals  who  exhibit  a  variety  of  perceptual  and  motor  deficiencies. 
As  a  result  of  these  considerations,  a  CRT-centered  accountability  system 
has  been  devised.  This  project  began  with  the  construction  of  a  bank  of 
objectives  and  test  items  to  mesh  with  the  type  of  diagnostic  individual¬ 
ization  peculiar  to  the  education  of  the  mentally  handicapped.  To  meet 
these  needs,  the  objectives  were,  of  necessity,  highly  specified.  Tlio 
CRT-guided  instructional  system  was  geared  to  yield  information  to  support 
three  types  of  decision:  placement,  immediate  achievement,  and  retention. 
Standardized  diagnostic  and  achievement  tests  were'  also  used  to  aid  In  place 
ment  decision.  Hie  system  is  still  in  the  early  stages  of  implementation  so 
no  comment  can  be  made  concerning  its  ultimate  usefulness. 

More  recently,  Popham  kl /,''>)  presents  considerable  data  concerning  the 
use  of  teacher  performance  tdsts.  These  tests  require  a  teacher  to  develop 
a  "mini-lesson"  from  an  explicit  instructional  objective.  After  planning 
the  lesson,  the  teacher  instructs  a  small  group  of  learners  for  a  small 
period  of  time.  At  the  conclusion  of  the  "mini-lesson”,  the  learners  are 
given  a  post-test.  Affective  information  is  derived  by  asking  the  learners 
to  rate  the  interest  value  of  the  lesson.  Popham  reviews  three  potential 
applications  of  the  teacher  performance  test: 

1.  A  focusing  mechanism.  To  provide  a  mechanism  to  focus  the  teachers' 
attention  on  the  effects  of  instruction,  not  on  "gee-whiz"  methods. 

2.  A  setting  for  testing  the  value  of  Instructional  tactics.  The 
•teacher  performance  test  can  be  used  as  a  "test  bed"  to  evaluate  the 
differential  effectiveness  of  various  instructional  techniques.  The  teacher 
need  not  be  the  instructor,  but  the  important  aspect  of  this  application 


involves  a  post-lesson  analysis  in  which  the  instructional  approach  is 
appraised  in  terms  of  its  effects  on  learners. 

A  formative  or  simwnative  evaluation  device.  Popham  views  this 
application  of  teacher  performance  tests  to  program  evaluation  to  be 
extremely  important,  particularly  in  the  appraisal  of  in-service  and  pre¬ 
service  teacher  education  programs. 

Popham  presents  three  in-service  and  pre-service  applications  of  the 
teacher  performance  tests.  These  applications  were  for  the  most  part 
viewed  as  effective.  However,  a  number  of  problems  were  revealed  in  the 
course  of  these  applications  that  may  be  symptomatic  of  performance  tests  in 
general.  Popham  found  that  unless  skilled  supervisor's  were  used  in  the 
conduct  of  the  mini-lesson,  most  of  the  advantages  of  the  post-lesson  analysis 
were  lost.  Popham  also  found  that  visible  dividends  were  gained  by  the  use  of 
supplemental  normative  information  to  give  the  teacher  and  the  evaluation  a 
bit  more  information  regarding  the  adequacy  of  performance.  In  a  similar  area 
of  endeavor.  Baker  reports  the  use  of  a  teacher  performance  tost  as 

a  dependent  measure  in  the  evaluation  of  Instructional  techniques.  Baker  . 
discussed  some  shortcomings  of  the  use  of  CRTs  as  dependent  variables.  Those 
shortcomings  are  largely  based  on  the  peculiar  psychometric  properties  of 
CRTs.  However,  Baker  feels  that  CRT  is  valuable  for  research  purposes  even 
with  the  large  number  of  unanswered  questions  concerning  their  reliability 
and  validity.  Baker  points  out  "...if  the  tests  have  imperfect  reliability 
coefficients  in  light  of  imperfect  methodology ,  the  researcher  is  compelled 
to  report  the  data,  qualify  one's  conclusions,  and  encourage  replication." 
Baker  also  feels  the  use.  of  teacher  performance  tests  with  the  indeterminate 
psychometric  characteristics  is  not  ethically  permissible  for  evaluation  of 
indiv  lduals--at  least  for  the  present. 

In  a  slightly  different  area  ot  application,  Knipe  (Vr,''«x  summarizes  the 
experience  of  the  Crand  Forks  Learning  System  in  which  CRTs  played  it  very 
salient  part.  The  Crand  Forks  School  District  began  by  specifying  in  detail 
the  performance  objectives  for  K-l:'  in  most  subject  areas.  These  objectives 
were  to  form  the  basis  of  a  comprehensive  set  of  teacher/ learner  contracts 
as  one  instructional  method  by  which  students  could  meet  the  objectives.  It 
was  found  (hat  mathematics  was  the  subject  area  most  ar.tonah  1  e  to  analysis  and 
therefore  received  the  most  extensive  treatment.  The  mathematics  test 
consisted  of  approximately  l.'O  crlterlon-koyed  items  lor  each  grade  level  -■». 
After  extensive  tryout  the  items  were  revised  on  the  basis  of  teacher  and 
student  rocommendnt ions  as  well  as  on  the  basis  of  a  psychometric  analysis. 

The  inclusion  of  psychometric  analysts  as  a  device  to  direct  the  revision 
of  items  seems  questionable  In  view  of  the  limited  variance  of  CRTs.  In 
summary,  however,  the  teachers  regarded  the  CRT's  as  useful  In  supplementing 
NRTs,  and  in  addition  found  them  useful  for  placement.  Finally,  Knipe 
concludes,  "The  cri terion  "reference  test  is  the  only  type  of  test  that  a 
school  district  can  use  to  determine  if  it  is  working  toward  Its  curriculum 
goals.1? 


MILITARY  USES 


Extensive  experience  with  use  of  CRT  was  reported  by  Taylor,  Michaels, 
and  Brennan  in  co.  lection  with  the  Experimental  Volunteer  Army 

Training  Program  EVATP '  To  standardize  EVATP  instruction,  reviews,  and 
testing,  performance  tests  covering  a  wide  variety  of  content  were 
developed  and  distributed  to  instructors.  The  tests  were  revised  as 
experience  accumulated;  some  tests  were  revised  as  many  as  three  times. 

Drill  sergeants  used  the  tests  for  review  or  remediation,  while  testing 
personnel  used  them  in  the  administration  of  the  general  subjects,  comprehen¬ 
sive  performance  and  MOS  tests.  The  tests  also  provided  the  basis  for  the 
EVATP  Quality  Control  System  which  was  intended  to  check  on  skill  acquisi¬ 
tion  and  maintenance  during  the  training  process.  Unfortunately,  problems 
were  encountered  with  the  change  in  role  required  of  the  instructors  and 
dri’l  sergeants  under  the  system  of  skill  performance  instruction  and 
training.  Considerable  effort  was  required  to  bring  about  the  desired  changes 
in  instructor  role.  The  CRT-based  quality  control  system  performed  its 
function  well  by  giving  an  early  indication  of  problems  in  the  new  instructional 
system.  Evaluation  of  the  performance-based  system  revealed  clear-cut 
superiority  over  the  conventional  instructional  system.  The  problems  with 
institutional  change  encountered  by  these  workers  should  be  noted  by  anyone 
proposing  drastic  innovation  where  a  traditional  Instructional  system  is 
well-established. 

Piepor,  Catrow,  Swczey  **  Smith  present  a  description  of  a  performance 
test  devised  to  evaluate  the  effectiveness  of  an  experimental  training 
course.  The  course  was  individualized,  featuring  an  automated  apprenticeship 
instructional  approach.  Test  item  development  for  the  course  performance  test 
was  based  on  an  extensive  task  analysis.  Hie  task  analysis  included  many 
photographs  of  Job.  incumbents  performing  various  tasks.  These  photos  served 
as  stimulus  materials  for  the  tests  and  were  accompanied  by  questions  requiring 
"What  would  I  do"  responses  or  identification  of  correct  vs.  incorrect  task 
performance.  All  items  were  developed  for  audio-visual  presentation  permit¬ 
ting  a  high  degree  of  control  over  testing  conditions.  Items  were  selected 
which  discriminated  among  several  criteria.  Internal  consistency  reliability 
was  also  obtained.  This  effort  is  illustrative  of  good  practice  in  CRT 
development  and  shows  cleverness  in  the  use  of  visual  stimuli— the  statistical 
treatments  used  in  selecting  itetr-s  are,  however,  questionable.  A  somewhat 
similar  development  project .entitled  Learner  Centered  Instruction  ^LCl'  ^Piepor 
it  Swezey),  also  describes  a  CRT  development  process,.  Here,  a  major  effort  was 
devoted  to  using  alternate  form  CRTs,  not  only  for  training  evaluation,  but  also 
for  a  field  follow-up  performance  evaluation  after  trainees  had  been  working  in 
field  assignments  for  six  months. 

Air  Force  Pamphlet.  ■  O-'-o,  the  Handbook  for  Designers  of  Instructional 
Sys terns .  is  a  seven  volume  document  which  includes  a  volume  dealing  with 
CRTs.  A  Job  performance  orientation  to  CRT  is  advocated.  Specific  guide¬ 
lines  for  task  analysis  and  for  translating  criterion  objectives  into  test 
items1  are  presented  in  "hands-on  performance"  and  in  written  contexts.  The 
document  is  an  excellent  guide  to  the  basic  "do's"  and  "don'ts"  in  CRT 
construction.  A  similar  Army  document,  T'RADOC  Regulation  •>' *>-100-1 ,  Systems 
Engineering  of  Training  presents  guidelines  for  developing  evaluation 


materials  and  for  quality  control  of  training.  CRTs  are  used  Interchange¬ 
ably  with  "performance  tests"  and  with  "achievement  tests"  In  this  document. 

Fhe  areas  of  CRT  in  particular  and  of  evaluation  in  general  are  given 
minimal  coverage.  CON  I'am  ’‘-0-11  Is  essentially  a  revision  of  TRADOC 
Regulation  ’-a'-IO'-I,  revised  to  he  compatible  with  unit  training  requirements. 
This  document  although  briefly  mentioning,  testing  and  quality  control, 
presents  virtually  no  discussion  of  CRT. 

Various  Armv  schools  have  developed  manuals  and  guides  for  their  own 
use  in  the  area  of  systems  engineering  of  training.  TTie  Army  lntantrv 
school  at  Fort  Benning,  Georgia  tor  example,  has  published  a  series  of 
Training  Management  Digests  as  well  as  a  Training  Handbook  ,and  Inst  motor's 
Handbook .  There  also  exist  generalized  guidelines  for  developing  performance- 
oriented  test  items  lit  term*  of  memoranda  to  MOS  test  item  writers  and  via 
the  contents  of  the  TEC  11  program  \  Training  Extension  Course'.  The  Field 
Artillery  school  at  Fort  Sill.  Oklahoma  provides  an  Instruct  tonal  Systems 
Dove lopment  Course  pamphlet  as  well  as  booklets  on  Preparation  of  Written 
Ach ievemont  Kxaminat ions  and  an  Examination  Policy  and  Procedures  Guide  in 
the  gunnery  department.  The  Armor  school  at  Fort  Knox,  Kentucky,  publishes 
an  Operational  Policies  and  Procedures  guide  to  the  systems  engineering  of 
training  courses.  Generally  these  documents  provide  a  cursory  coverage  of 
CRT  development,  if  it  is  covered  at  all. 

The  Armv  Wide  Training  Support  group  of  the  Air  Defense  school  at  Fort 
Bliss,  Texas  provides  an  interesting  concept  in  evaluation  of  correspondence 
course  development.  Although  correspondence  course  examinations  are 
necessarily  paper  and  pencil  ^albeit  criterion-referenced  to  the  extent 
possible'  manv  such  courses  contain  an  OJT  supplement  which  is  evaluated 
via  a  per t ormanco  test  administered  bv  a  competent  monitor  in  the  field 
whore  the  correspondent  is  working.  This  is  a  laudable  attempt  to  move 
toward  performance  testing  in  correspondence  course  evaluation.  A  supple¬ 
ment  to  TRADOC  Reg  oh  developing  evaluation  instruments  has  also 

.*een  prepared  here.  This  guide  provides  examples  of  development  of  evalua¬ 
tion  instruments  in  radar  checkout  and  maintenance  and  in  leadership  areas. 

A  course  entitled  "Objectives  for  Instructional  Urograms"  vlnsgroup, 
l*.''  which  is  used  on  a  number  of  Army  installations  has  provided ,a  dia¬ 
grammatic  guide  to  the  development ■ of  it  ‘motional  programs.  CRT  is  not 
covered  specifically  in  this  document,  nor  is  it  addressed  in  the  recent 
Army  "state-of-the-art"  report  on  instructional  technology  i Hr. in son.  Stone, 
H.uuuim,  and  Ravnor,  1'  However,  a  CISTRAIN  v Coordinated  Instructional 

Systems  Training'  course  vlVterline  v  Lenn,  W'a,  h' ,  which  is  also  used 
at  Armv  instal lat ions  for  training  instructional  systems  developers,  does 
deal  with  CRT  development  and,  in  fact,  provides  instructions  for  writing 
items  and  for -developing  CRTs,  lime  study  guide  ^Deterline  and  t.cnn,  W.’b' 
deals  with  topics  such  as  developing  criteria,  identifying  objectives, 
selecting  objectives  via  task  analysis,  developing  baseline  CRT  items, 
revising  first  draft  items  and  preparing  feedback.  This  document  provides 
a  good  discussion  of  CRT  development  in  an  overview  fashion. 


U.S.  Army  Field  Manual  21-t>  (£0  January  l'H'7'  provides  trainers  and 
instructors  of  U.S.  Army  in-service  schools  with  guidance  in  the  preparation 
of  traditional  instruction,  e.g.,  lecturer,  conferences,  and  demonstrations. 
FM  .'l-o  L’O  January  l*  167!  contains  a  great  deal  of  Information  on  construc¬ 
tion  of  achievement  tests  but  the  "why's"  and  "how's"  are  largely  lacking. 

The  section  on  performance  testing  seems  designed  to  discourage  the  construc¬ 
tion  and  use  of  performance  tests.  In  addition,  the  manual  is  weak  on  task 
analysis  procedures  --  procedures  in  general  lack  definition  of  method.  All 
testing  concepts  are  directed  at  the  construction  of  norm- referenced  tests 
of  either  job  knowledge  or  performance.  There  is  no  discussion  of  how  to 
set  cut-offs,  or  any  discussion  of  the  issues  peculiar  to  CRT.  The  emphasis 
is  on  relative  achievement .  Recent Ly,  FM  Jl-o,  has  undergone  comprehensive 
revision  to  suit  the  needs  of  field  trainers.  The  revised' manual  (1  December 
r,?,'7>)  is  generally  in  tune  with  contemporary  training  emphasis  with  consider¬ 
able  information  on  individualized  training  and  team  training.  In  particular, 
the  extensive  guidance  provided  on  objective  generation  should  prove  very 
useful  to  field  trainers.  While  the  revised  FM  .'l-o  does  not  specifically 
refer  to  CRT,  the  obvious  emphasis  on  NRT  which  distinguished  the  earlier 
version  is  gone.  A  possible  weakness  in  the  revised  version  is  the  tacit 
assumption  that  al l  trainees  will  reach  the  specified  standard  of  perfor¬ 
mance.  Although  the  requirement  that  all  trainees  reach  criterion  is  not  by 
itself  unreasonable,  practical  constraines  of  time  and  cost  sometimes 
dictate  modified  standards,  e.g.,  reaching  criterion.  Where  it  is  not 

feasible  to  wash-out  or  to  recycle  trainees,  then  remediation  must  be  designed 
to  permit  an  economical  solution.  FM  -l-<>  does  not  seem  to  address  the 
remediation  problem.  In  general,  though,  FM  .'l-o  is  a  good  working  guide  to 
field  training.  It  will  be  Interesting  to  see  how  effective  it  is  in  the 
hands  of  typical  field  training  personnel. 


From  these  limited  examples  it  appears  that  the  civilian  sector  has  led 
in  the  development  and  use  of  CRTs.  Although  the  EVATP  effort  is  a  notable 
exception,  the  use  of  CRTs  in  military  operations  hajs  been  slowed  by  the 
high  initial  cost  of  developing  criterion-referenced  performance  tests. 

Often  the  use  of  CRTs  for  performance  assessment  has  required  operational 
equipment  or  interactive  simulators,  drastically  raising  cpsts.  School  systems 
have  had  success  with  CRTs,  largely  due  to  the  nature  of  the  content  domains 
chosen.  These  content  domains  heavily  emphasize  knowledge;  hence  tests  can 
be  paper  and  pencil  which  are  cheap  to  administer, 
problem  may  be  found  in  the  notion  of  Osborn  (IJ70M 
approach  to  "synthetic  performance  tests"  which  may 


A  solution  to  the  cost 
who  has  devised  an 
Lead  to  Lowered  testing 


costs,,  although  little  concrete  evidence  has  appealed  in  the  literature  to 
date . 


INDIRECT  APPROACH  TO  CRITERION-REFERENCING 


Fromer  feels  that  is  is  meaningful  to  relate  performance  on 

Survey  Achievement  tests  to  significant  real-life  criteria,  such  as  minimal 
competency,  in  a  basic  skills  area.  The  author  discusses  various  ways  of 
relating  survey  test  scores  and  criterion  performance.  All  of  these 
approaches  arc  aimed  at  criterion-referenced  interpretation  of  test  scores. 
Premcr  proposes  that  direct  criterion-referenced  Inferences  about  an  exam¬ 
inee's  abilities  need  not  be  restricted  to  tests  that  arc  composed  of 


.u  tu.il  samples  ot  the  bohivior  of  interest.  Fromer  feels  tbit  considerable 
use  can  be  made  of  the  relationships  observed  among  apparently  diverse 
tasks  within  global  content  areas.  Fromer  further  argues  that  tasks  which 
are  not  samples  of  an  objective  may  provide  an  adequate  basis  for  generali¬ 
zation  to  that  objective.  Fromer  notes  that  given  a  nearly  infinite  popula¬ 
tion  of  objectives',  the  use  of  a  survey  instrument  as  a  basis  for  making 
criterion-referenced  inferences  would  allow  increased  efficiency. 

An  example  is  offered  of  the  use  of  a  survey  reading  test  to  make 
inferences  about  ability  to  read  a  newspaper  editorial.  A  CRT  of  ability  to 
read  editorials  might  consist  of  items  quite  different  from  the  behavior  of 
interest.  Fremer  offers  an  illustrative  example  of  using  vocabulary  test 
scores  to  define  objective-referenced  statements  of  ability  to  read  edito¬ 
rials.  Fremer  notes,  however,  that  the  usefulness  of  interpretive 
tables,  i.o.,  those  that  provide  statements  referencing  criterion  behaviors 
to  a  range  of  test  scores,  depends  heavily  on  the  method  used  to  establish 
the  relationship  between  the  survey  test  scores  and  the  objective-referenced 
ability.  As  essential  aspect  would  be  the  use  of  a  large  and  broad  enough 
sample  of  criterion  performance  to  permit  generalization  to  the  broader 
range  of  performances.  Fromer.' s  example  provides  for  the  definition  of 
several  levels  ot  mastery  and  points  out  that  an  absolute  dichotomy,  mastery 
versus  non-masterv,  will  seldom  be  meaningful.  It  is  difficult  to  under¬ 
stand  why  Fremer  makes  this  statement,  as  the  basic  use  of  CRT  is  to  decide 
whether  an  individual  possess  s  sufficient  ability  to  be  released  into  the 
field  or  requires  further  instruction .  Many  levels  of  performance  can  be 
identified,  but  are  ultimately  reduced  to  pass-fail.  Mastery /Non-Mastery. 
Fremer  apparently  bases  his  objection  on  measurement  error  which  can  render 
classification  uncertain.  However,  as  discussed  earlier,  proper  choice  of 
cut-off  and  careful  attention  to  development  should  minimise  classifica¬ 
tion  errors .  Fremer  proposes  that  the  notion  of  minimal  competency  should 
encompass, a  variety  of  behaviors  of  varving  .importance— the  metric  of 
importance  will  vary  with  the  goals  of  the  educational  system. 

Fremer  1  >  proposes  a  method  for  relating  survey  test  performance 
to  a  minimal  competency  standard  that  would  involve  a  review  of  the  propor¬ 
tion  of  students  at  some  point  in  the  curriculum  who  are  rated  as  failures. 
This  should  serve  as  a  rough  estimate  of  the  proportion  of  students  failing 
to  achieve  minimal  competency.  It  would  then  bo  possible  to  apply  this 
proportion  to  the  score  distribution  for  the  appropriate  test  in  a  survey 
achievement  test,  clearly  a  normative  approach.  A  second  approach  to 
referencing  survey  achievement  tests  to  a  criterion  ot  minimal  competency 
would  bo  to  acquire  instructor  judgement  as  to  the  extent  to  which  individual 
items  could  bo  answered  by  students  performing  at  a  minimal  level.  By 
sunwing  across  items,  it  would  bo  possible  to  obtain  an  estimate  of  the. 
expected  minimum  score.  Fremer,  however,  recognizes  the  limitations  of 
this  latter  process  with  its  high  reliance  on  informed  Judgement.  A  further 
method  proposed  by  Fremer  seeks  to  define  minimal  competency  In  terms  of 
student  behaviors.  Tito  outcome  of  this  method  would  bo  the  identification 
of  bands  of  tost  scores  that  would  bo  associated  with  minimal  competency. 

The  processes  involved  in  this  method  also  roly  on  informed  Judgment,  though. 


■1 


Another  method  proposed  by  Fremer  tq  criterion-referenced  survey 
achievement  tests  involves  developing  new  tests  with  a  very  narrow  focus, 
i.e.,  a  smaller  area  of  content  and  a  restricted  range  of  difficulty.  It. 
should  not  be  necessary  to  address  every  possible  objective.  However,  it 
should  be  possible  to  develop  a  test  composed  of  critical  items  by  sampling 
from  the  pool  of  items.  The  next  step  in  the  process  would  involve  relating 
achievement  a'  various  curriculum  placements  between  the  focused  test  and 
the  survey  instrument.  This  should  allow  keying  of  the  items  on  the  survey 
test  with  specific  critical  objectives. 

Still  .".aether  method  put  forth  by  Fremer  to  get  from  ciiterion- 
re forence ‘  to  survey  tests  is  the  stand-alone  work  sample  test.  This 
technicue  is  intended  for  use  when  there  is  an  objective  that  is  of  such 
inter: st  that  it  should  be  measured  directly.  The  procedures  that  Fremer 
put?  forth  are  very  clever  in  concept  and  are  mainly  applicable  to  school 
systems  and  traditional  curricula  where  we  11 -developed  survey  instruments 
eiist.  Even  so,  considerable  work  is  involved  in  keying  the  survey  instru¬ 
ment.  In  non-school  system  instructional  environments,  dealing  with  non- 
traditional  curricula,  it  is  unlikely  that  an  appropriate  survey  instrument 
would  exist. 


USING  NRT  TO  DERIVE  CRT  DATA 

Cox  and  Sterrett  propose  an  interesting  method  for  using  NRTs 

to  provide  CRT  information.  The  first  step  in  this  procedure  is  to  specify 
curriculum  objectives  and  to  define  pupil  achievement  with  reference  to 
these  objectives.  The  second  step  would  involve  coding  each  standardized 
test  item  with  reference  to  curriculum  objectives.  With  coded  test  items 
and  knowledge  of  the  position  of  each  pupil  in  the  curriculum,  it  is  possi¬ 
ble  to  determine  the  item's  validity  in  the  sense  that  pupils  should  be  able 
to  correctly  answer  items  that  are  coded  to  objectives  that  have  already  been 
covered.  Step  three  is  the  scoring  of  the  test  .independently  for  each  pupil, 
taking  into  account  his  position  in  the  curriculum.  The  authors  recommend 
that  this  model  is  particularly  applicable  to  group  instruction,  since  place¬ 
ment  in  the  curriculum  can  generally  be  regarded  as  uniform.  Therefore,  it 
is  possible  to  assign  each  pupil  a  score  on  items  whose  objectives  he  has 
covered.  It  is  also  possible  to  obtain  information  on  objectives  which  were, 
excluded  or  not  yet  covered.  This  method  seems  an  economical  way  to  extract 
CRT  information  and  NRT  information  from  the  same  instrument.  The  technique 
has  yet  to  be  explored  in  practice,  however. 

CONSIDERATIONS  FOR  A  CRT  IMl’LENLNTATl ON  MODEL 

The  development  and  use  of  CRT  is  a  fairly  recent  development  in  instruc¬ 
tional,  technology.  Partially  as  a  result  of  this,  there  is  no  comprehensive 
theory  of  CRT  such  as  exists  for  NRT.  Hence,  the  concepts  of  validity  and 
reliability  .'or  CRT  are  not  yet  well  developed,  although  definition  of  these 
concepts  is  necessary  to  reduce  errors  of  classification.  The  need  for 
content  validity  in  CRT  is,  however,  well  recognized.  In  addition,  there  is 
no  single  CRT  construction  methodology  which  will  serve  for  all  content 


domains.  Unresolved  questions  also  revolve  around  the  question  of 
Bandwith  fidelity  and  the  use  of  reduced  fidelity  in  criterion-referenced 
performance  tests. 

The  rationale  for  the  use  of  CRT  in  evaluating  training  programs  and 
describing  individual  performance  is  well  established.  To  ensure  best 
possible  results,  the  military  or  industrial  user  should  exert  every  effort 
to  maintain  stringent  quality  control,  including: 

1.  Careful  task  analysis: 

a.  Observation  of  actual  job  performance  when  possible 

b.  Identification  of  all  skills  and  knowledge  that  must  be  trained. 

c.  Careful  identification  of  job  conditions 

d.  Careful  identifeation  of  job  standards 

e.  Identification  of  critical  tasks. 

2.  Careful  formulation  of  objectives 

a.  Particular  care  in  the  setting  of  standards 

b.  Identification  of  all  enabling  objectives 

c.  Independent  check  on  the  content  of  the  objectives 

d.  Special  attention  to  critical  tasks. 

Item  development 

a. ,  Determine  if  all  objectives  must  be  tested 

b.  Survey  of  resources  for  test 

c.  Determination  of  item  form 

d.  Statement  of  rules  for  items 

e.  Development  of  item  pool  for  objectives  to  be  tested 

f.  Develop  tryout  plan  and  criteria  for  item  acceptance 

g.  Tryout  of  items 

h;  Revision  and  rejection  of  items. 


-  - 


Particular  care  must  be  exercised  in  setting  item  acceptance  criteria 
for  item  tryout.  Hie  use  of  typical  NRT  item  statistics  should  be  minimized. 

The  usual  methods  are  totally  inadequate,  i.e.,  internal  consistency 
estimates  are  only  suitable  with  large  numbers  of  items;  in  addition,  internal 
consistency  may  not  be  an  important  consideration.  Traditional  stability 
indexes  may  also  be  inappropriate  due  again  to  small  numbers  of  items  and 
reduced  variance.  The  technique  proposed  by  Edmonston  et  al.  may  prove 

effective  in  reducing  errors  of  misclassification  due  to  inadequate  test  items. 

By  adhering  to  strict  quality  control  measures,  it  should  be  possible  to 
obtain  a  set  of  measures  that  have  a  strong  connection  with  a  specified  content 
domain.  Whether  or  not  they  are  sensitive  to  instruction,  or  if  they  will 
vary  greatly  due  to  measurement  error  is  unknown.  Careful  tryout  and  field 
follow-up  may  currently  be  the  best  controls  over  errors  of  misclassification 
due  to  poor  measurement.  The  ethical  question  of  the  use  of  measures  with 
unknown  psychometric  properties  in  making  decisions  about  individuals  remains  to 
be  addressed. 


COST-BENEFITS  CONSIDERATION 

Although  the  costs  of  training  and  the  costs  of  test  administration  can 
readily  be  quantified  in  dollar  terms,  we  lack  a  proper  metric  to  completely 
assess  the  costs  of  misclassification.  Emrick  (lsf/l)  proposes  a  ratio  of  regret 
to  quantify  relative  decision  error  costs.  Emrick’s  metric,  however,  appears 
rather  arbitrary  and  in  need  of  further  elaboration.  The  probability  of  mis¬ 
classification  is  the  criterion  against  which  an  evaluation  technique  must  be 
weighed.  The  results  of  misc la -sif ication  range  from  system-related  effects 
to  interpersonal  problems.  In  some  instances  where  misclassification  results 
In  a  system  failure,  cost  can  be  accurately  measured,  and  is  likely  to  bo  high. 

A  relative  index  of  cost  can  be  gained  from  the  task  analysis.  If  the 
analysis  of  the  job  reveals  a  largo  number  of  critical  tasks  or  individual 
tasks  whose  criticality  As  great,  then  the  cost  of  supplying, a  non-master  can 
be  assessed  as  high,  and  great  effort  is  justified  in  developing  a  training 
program  featuring  high  fidelity,  costly  CRT.  Where  the  analysis  does  not 
reveal  high  numbers  of  critical  tasks,  the  cost  then  becomes  a  function  of 
,  less  quantifiable  aspects.  Misclassification  also  results  in  job  dissatisfac¬ 
tion  and  morale  problems  evidenced  by  various  symptoms,  of  organizational 
illness,  e.g.,  absenteeism,  high  turnover,  poor  work  group  cohesion,  etc.  . 

A  possible  solution  to  the  cost-benefit  dilemma  may  come  from  work  with 
symbolic  performance  tests  and  the  work  cited  earlier  showing  that  job  knowl¬ 
edge  tests  can  sometimes  suffice.  The  use  of  symbolic  tests  and/or  job 
knowledge  tests  would  result  in  greatly  reduced  testing  costs  in  many  instances. 
The  decision  as  to  the  appropriateness  of  the  test  must  be  made  empirically 
on  the  basis  of  well  controlled  tryout  with  typical  course  entrants.  The 
development  of  symbolic  performance  tests  may  prove  to  be  difficult.  Much  is 
yet  to  be  known  about  how  to  approach  this  development.  If  progress  can  be 
made  in  lowering  the  cost  of  CRT  then  the  problem  of  cost-benefit  analysis 
will  be  made  in  lowering  the  cost  of  CRT  then  the  problem  of  cost-benefit 
analysis  will  be  largely  obviated. 


As  che  question  currently  stands,  there  is  no  doubt  that  CRT  provides 
a  good  basis  for  evaluation  of  training  and  the  determination  of  what  a 
trainee  can  actuaLly  do.  If  the  system  in  which  the  trainee  must  function 
produces  a  number  of  critical  functions  which  will  render  misclassification 
expensive,  then  CRT  is  a  must. 


PART  2--SURVEY  OF  CRITERION-REFERENCED  TESTING  IN  THE  ARMY 


PURPOSE  AND  METHOD  OF  THE  SURVEY 

In  order  to  survey  the  application  of  criterion- referenced  testing 
techniques  in  the  military,  a  number  of  Army  installations  were  visited. 
Information  was  collected  to  supplement  the  literature  search  and  review, 
to  provide  detailed  material  on  CRT  development  and  use  in  the  Army,  and 
to  obtain  information  on  attitudes  and  opinions  of  Army  testing  personnel. 

Specifically,  the  survey  gathered  data  on: 

1.  How  CRTs  are  developed  for  Army  applications.  In  order  to 
create  a  CRT  construction  manual  which  will  be  useful  to  Army  test  devel¬ 
opers,  it  is  necessary  to  determine  how  CRTs  are  currently  developed  in  the 
Army.  Additionally,  it  is  important  to  determine  differences  in  test  devel¬ 
opment  strategies  across  Army  installations,  so  that  the  manual  can  suggest 
procedures  which  will  mate  well  with  a  variety  of  approaches. 

2.  How  CRTs  are  administered  in  various  Army  contexts.  This 
information  is  important  since  design  for  administration  materially  affects 
the  test  construction  process.  Design  information  is  jportant  in  creating 
guidelines  on  development  of  CRTs,  in  order  to  make  them  suitable  for 
administration  in  diverse,  Army  testing  situations. 

J.  How  CRT  results  are  used  in  the  Army.  The  way  in  which  a  test's 
results  are  used  is  a  factor  that  must  be  considered  in  the  development  of 
any  test.  Hence,  the  survey  obtained  data  on  use  of  test  results  in  a 
variety  Of  Army  testing  situations. 

1* .  Extent  of  criterion-referenced  testing  in  the  Army.  This  includes 
information  on  extensity--how  prevalent  criterion-referenced  testing  is  in 
the  large.  Army-wide  sense;  and  information  on  intensity- -how  much  testing 
in  specific  Army  contexts  is  of  a  criterion-referenced  type. 

5*  The  level  of  personnel  who  will  use  the  CRT  Construction  Manual 
developed  by  the  project.  This  information  includes  educational  levels,  range 
of  military  experience,  and  familiarity  with  psychometric  concepts.  Such 
information  is  designed  to  help  tailor  the  manual  to  its  audience. 

6 .  Problems  encountered  by  Army  testing  personnel  in  the  develop¬ 
ment  and  use  of  criterion-referenced  tests.  Information  on  problems  serves 
two  purposes.  First,  the  identification  of  typical  problem  areas  points 
the  way  toward  future  research  on  criterion- referenced  testing.  Second,  the 
CRT  Construction  Manual  can  deal  with  typical  problems,  offering  suggestions 
for  avoiding  or  surmounting  them. 
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7-  Attitudes  of  Array  testing  personnel  toward  the  development  and 
use  of  CRTs.  It  is  important  to  assess  existing  attitudes  toward  CRTs 
among  Army  testing  personnel,  since  level  of  acceptance  is  an  indicator  of 
spread  and  utility  of  a  new  concept.  Additionally,  attitudinal  data  will 
enable  the  CRT  Construction  Manual  to  address  current  attitudes,  and  thus  to 
attempt  to  rectify  poor  attitudes  based  upon  misconceptions. 

6 .  The  probable  future  course  of  criterion-referenced  testing  in 
the  Army.  Interview  data,  particularly  that  collected  from  personnel  at 
supervisory  levels,  indicate  probable  trends  in  future  Army  CRT  use.  Also, 
problems  in  implementing  CRT  applications  suggest  needed  research; 

Sample  Army  CRTs  and  problems  in  developing  and  using  them.  An 
important  part  of  the  on-site  survey  is  to  gather  materials  to  serve  as  the 
basis  for  examples  of  CRT  development  and  use. 


Interview  Protocol  Deve lopment.  In  order  to  gather  these  types  of  infor¬ 
mation,  an  interview  protocol  for  on-site  use  at  various  Army  posts  was 
developed.  Development  of  the  protocol  included  several  review  phases  during 
which  revised  versions- of  the  protocol  were  prepared.  The  second  version  of 
the  protocol  consisted  of  three  forms:  One  to  be  used  in  interviews  with  test 
constructors,  another  for  test  users,  and  a  third  to  be  used  with  supervisory 
personnel.  The  final  instrument  combined  these  forms  and  included  several 
optional  items  for  use  in  interviews  with  personnel  who  were  especially 
knowledgeable  about  cri tori  on-referenced  testing.  The  final  version  of  the 
protocol  was  found  to  have  high  utility,  since  it  can  be  used  to  structure 
interviews  with  personnel  who  serve  any  of  three  functions  test  construction, 
test  use,  and  supervision ' .  The  protocol  provides  flexibility  in  the  range 
of  topics  to  be  discussed  in  an  interview,  thereby  allowing  interviews  to  be 
tailored  to  the  ranges  of  responsibilities,  experience,  and  knowledge 
possessed  by  individual • interviewees.  Appendix  A  of  this  report  is  a  copy 
of  the  final  version  of  the  protocol. 

The  interview  protocol  was  used  in  a  series  of  one-to-one  interviews 
conducted  during  January.  February  and  March  1  /'■’>.  Installations  surveyed 
during  this  period  included  the  Infantry  School  at  Fort  Benning,  the  Artillery 
School  at  Fort  Sill,  the  Air  Defense  School  at  Fort  Bliss,  the  Armor  School 
at  Fort  Knox,  and  BCT  and  AIT  units  at  Fort  Ord .  In  addition,  test-related 
departments  were  surveyed  at  each  post.  A  total  of  10'  individuals  were 
interviewed. 

Survey  Teams .  A  survey  team  spent  three  days  at  each  post'  surveyed.  The 
Interviews  ranged  in  duration  from  approximately  one-half  to  three  hours 
apiece  and  averaged  about  one  and  one-half  hours.  Interview  length  was  at 
the  interviewer’s  discretion,  based  on  the  utility  of  the  information  obtained 
f  ron  a  Subject'.  •  • 

Summaries  of  the  types  of  personnel  surveyed  at  each  installation, 
presented  in  a’  following  suction  of  this  report,  indicate  each  interviewee’s 
position  <n  the  organ! /at  ion  for  Amv  School,  MOS ,  TKC .  and  Training  Center 
testing  programs,  and  whether  tb-*  individual  is  a  test  deve  loper/tiscr  (test 
administrator,  test  scorer,  etc.*  or  a -supervisor  of  test  construction  or  use. 
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and  TEC  (Training  Extension  Course)  Program  personnel.  No  Training  Center 
data  were  collected  at  Fort  Benning,  while  Fort  Ord  data  wore  exclusively 
with  Training  Center  programs. 

A  total  of  n'7  individuals  were  interviewed  in  School  organizations. 

This  focus  on  school  personnel  is  appropriate  since  the  CRT  Construction 
Manual  will  be  used  primarily  in  the  schools.  It  is  interesting  to  note 
that  of  the  "•>  subjects  who  were  asked  if  special  training  were  available 
for  testing  personnel,  almost  responded  yes.  This  does  not  mean  that 

of  the  subjects  asked  had  received  such  training,  but  that  training  in 
testing  techniques  is  available  in  the  Army.  Many  individuals  who  partici¬ 
pated  In  the  survey  were  experienced  in  constructing  or  administering  tests, 
and  several  had  received  special  training. in  testing.  For  a  more  detailed 
analysis  of  the  subjects  and  their  organizational  positions,  see  Appendix  B. 

Tables  through  present  summaries  of  responses  to  quantifiable 
proto  >1  items.  The  data  upon  which  these  summaries  are  based  are  in 
Appendix  C.  Note  that  since  interviews  were  tailored  to  address  the  knowl¬ 
edge  and  experience  of  the  individual,  not  all  subjects  were  asked  all  items 
For  example  if  it  was  established  that  an  individual  was  not  involved  in 
test  development  but  in  test  administrat ion  or  in  use  of  test  results,  that 
individual  was  not  queried  concerning  test  construction.  Hence,  in  Table 
for  example,  a  maximum'  of  '  individuals  responded  to  a  given  item. 


Test  Dove lopment .  Table  summarizes  responses  to  protocol  items 
concerning  involvement  with  various  steps  of  CRT  development.  Details  of 
Army  test  construction  processes  vary  widely;  however,  some  impressions  of 
the  test  construction  process  can  he  gained  from  Table 

The  data  presented  in  Table  are  subject  to  interpretation.  For 
example,  although  slightly  over  half  of  the  -0  subjects  answered  "yes"  to  th 
protocol  item  about  using  an  item  analysis  technique  (item  ’b) ,  further 
questioning  during  the  interview  usual’.,  revealed  that  they  were  not  using  a 
formal  item  analysis  technique.  Instead,  they  typically  inspect  a  computer 
printout  of  percent <  right  and  wrong  responses  to  items  on  a  test.  Items 
having  an  unusually  high  number  of  wrong  responses  are  reworked  or  discarded 

After  the  final  test  items  are  selected.  Army  test  developers  usually  do 
not  assess  reliability  and  validity,  at  least  In  a  strict  psychometric  sense 
Instead,  the  tests  are  administered  several  times  and  items  that  cause  a 
great  deal  of  difficulty  are  reviewed  to  see  If  they  are  constructed 
properly--a  relatively  informal  process. 
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Table  1 


SURVEY  OF  CRITERION-REFERENCED  TESTING  IN  THE  ARMY: 
SUBJECTS  INTERVIEWED  AT  FORT  BENNING,  FORT  BLISS, 
FORT  SILL,  FORT  KNOX,  AND  FORT  ORD 
(N  -  10‘  a) 


School 

MOS 

Training 

Center 

TEC 

Program 

S 

s 

1 

0 

1 

Ft.  Benning,  Georgia 

TDU 

l-'t 

1 

0 

1 

S 

o  ; 

0 

1 

Ft.  Bliss,  Texas 

TDU 

1.’ 

0 

O 

S 

7 

1 

0 

o 

Fort  Sill,  Oklahoma 

TDU 

> 

0 

1 

S 

7 

1 

1 

0 

Ft.  Knox,  Kentucky 

TDU 

i 

0 

0 

0 

S 

0 

0 

10 

0 

Ft.  Ord,  California 

TDU 

0 

0 

10 

0 

Totals 

t'Y 

(> 

.  'If 

'  \ 

Total  Number  of  Supervisors  (Si  Interviewed:  1|1* 


Total  Number  of  Test  Developers/User.;  (TDU)  Interviewed:  t>l 
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Table  ?. 


INVOLVEMENT  IN  VARIOUS  STEPS  OF  TEST  DEVELOPMENT: 
SUMMARY  OF  RESPONSES  ACROSS  ALL  POSTS 


Item 

No. 

3 

Brief  Statement  of  Item 

Number  of 
Subjects 
Responding 
to  Item 

Percent 

of 

"Yes” 

Responses 

h 

Have  you  been  included  in  writing 
objectives 

7o 

,  *-? 

hb 

Do  you  write  objectives  in  opera¬ 
tional,  behavioral  terms? 

hr 

.  "1 

Have  you  participated  in  setting 
standards? 

6<> 

t  ; 

o 

Have  you  participated  in  imposing 
practical  constraints? 

nr 

- 

Have  you  helped  determine  priorities? 

ro 

<■" 

Have  you  been  included  in  writing 
test  items? 

i  o 

70 

b 

Do  you  write  item  pools? 

;-0 

06 

) 

Have  you  been  involved  in  selecting 
final  test  items? 

<>7 

*b 

Do  you  use  an  item  analysis  technique 

?  ‘i0 

'»:?  ■ 

11 

Do  you  measure  test  reliability? 

rh 

*)  5 

lib 

Do  you  compute  coefficients  of 
reliability? 

hr 

Uo 

1.-’ 

Do  you  aid  In  validating  tests? 

' '  • 

i 

>  ^ 

U'b 

Do  you  use  content  validity? 

hi 

For  complete  wording  of  the  protocol  Items,  sec  Appendix  A 
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It  appears  that  relative  care  is  taken  in  Army  test  development 
programs  to  select  and  define  objectives  and  their  associated  conditions 
and  standards.  Some  care  is  taken  in  writing  items  to  match  these 
objectives.  From  this  point  on,  however,  empirical  rigor  is  lacking;  that 
is,  formal  item  analysis  and  assessment  of  test  reliability  and  validity 
are  infrequently  done. 

Test  Administration.  Table  3  presents  subject  responses  to  protocol 
items  dealing  with  test  administration.  A  large  proportion  of  subjects  in 
the  survdy  have  been  involved  in  administering  tests.  This  is  not  surprising 
since  much  test  development  is  done  by  school  instructors;  thus,  individuals 
who  create  test  items  also  administer  the  tests  in  their  classes.  These 
are  heartening  data:  It  is  advantageous  for  test  developers  to  be  familiar 
with  test  administration  situations,  since  it  gives  them  increased 
familiarity  with  the  conditions  and  limitations  inherent  in  such  situations. 


Table  3 


INVOLVEMENT  IN  ASPECTS  OF  TEST  ADMINISTRATION: 
SUMMARY  OF  RESPONSES  ACROSS  ALL  POSTS 


I  tem 
No. 

Brief  Statement  of  Item'1 

Number  of 
Subjects 
Responding 
to  Item 

Percent 

of 

"Yes" 

Responses 

10 

Have  you  participated  in  adminis¬ 
tering  tests? 

'  *•  *  »  * 

10b 

Do  you  ever  use  the  "assist  method"? 

v:- 

6\> 

13 

Do  you  use  "go-no  go"  scoring 
standards? 

100 

itO  ■ 

lit  b 

Do  you  retest  trainees  who  fail 
the  first  time? 

•y> 

7! 

For  complete  wording  of  the  protocol  items,  see  Appendix  A 


Table  3  also  shows  that  an  "assist"  method  of  scoring  is  frequently 
used.  It  appears  that  test  administrators  often  find  it  appropriate  to 
provide  help  to  individuals  taking  the  test.  The  actual  percentage  of 
test  administrators  using  a  true  assist  method  is  probably  somewhat  lower 
than  that  shown  in  Table  3,  since  a  good  number  of  those  who  stated  that 
they  use  this  method  indicated  that  they  provide  help  only  if  testees  have 
difficulty  with  ambiguities  in  test  language  or  instructions.  In  a  true 
assist  method,  help  is  given  to  those  individuals  who  can  not  perform  a 
particular  item  for  whatever  reason.  Such  a  method  is  often  used  in  cases 
where  the  testee  could  not  otherwise  complete  the  test  (e.g.,  a  checkout 
procedure^. 

Less  than  half  of  the  100  subjects  queried  said  that  they  used  go-no  go 
scoring  standards  on  their  tests.  This  does  not  imply  that  more  than  half  of 
the  individuals  in  our  survey  necessarily'  use  normative  scoring  standards; 
instead,  many  use  point  scales  for  scoring. 

Over  70$  responded  that  trainees  who  fail  a  test  the  first  time  are 
retested.  There  are  many  cases  where  retesting  is  done.  For  example,  in  BCT. 
AIT  and  other  hands-on  performance  testing  situations,  trainees  are  often 
given  second  and  third  chances  to  pass  particular  performance  items. 

Uses  of  Test  Results.  The  primary  use  of  test  results  is,  of  course, 
to  evaluate  individual  performance.  This  is  true  whether  the  test  is 
criterion-referenced  or  normatively  based.  There  are,  however,  other  ways 
in  which  test  results  can  be  used.  Table  h  presents  a  summary  of  responses 
to  protocol  items  dealing  with  various  uses  of  test  results.  Table  L  shows 
that  the  most  common  uses  of  test  results,  ether  than  for  evaluation  of 
trainee  performance,  are  for  improving  training  and  for  diagnosis.  Test 
results  can  diagnose  areas  in  which  an  individual  is  weak  and  in  need  of 
remediation.  Seventy-two  percent  of  the  subjects  questioned  indicated  that 
they  use  test  results  for  diagnostic  purposes.  Diagnosis  is  usually  done 
informally:  Instructors  review  test  results  and  then  confer  with  trainees. 

Test  results  can  also  be  vised  to  assess  course  adequacy  in  the  formative 
evaluation  sense.  Seventy-three  percent  of  the  subjects  questioned  indicated 
that  they  use  feedback  from  the  tests  to  improve  courses.  The  way  in  which 
this  feedback  is  used  varies  widely.  For  example,  some  senior  instructors 
indicated  that  if  many  trainees  from  a  particular  instructor's  class  perform 
poorly  on  certain  parts  of  a  test,  they  would  first  evaluate  the  instructor. 

If  several  classes  taught  by  different  instructors  scored  poorly  on  a  section 
of  a  test,  the  senior  instructor  might  review  the  materials  used  in  that 
portion  of  the  course.  In  other  situations,  the  test  itself  is  reviewed 
using  feedback  from  the  students.  For  example,  if  a  test  item  is  unclearly 
worded  or  if  the  performance  called  for  is  unclear,  student  feedback  is  a 
valuable  tool. 
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Table  U 


USE  OF  TEST  RESULTS  OTHER  THAN  EVALUATING  INDIVIDUAL  PERFORMANCE: 
SUMMARY  OF  RESPONSES  ACROSS  ALL  POSTS 


1  tem 
No. 

brief  Statement  of  Item® 

Number  of 
Subjects 
Responding 
to  Item 

Percent 

of 

“Yes" 

Responses 

1  h 

Do  you  use  test  results  to  compare 
trainees? 

>1 

0  5 

1 

Do  you  use  test  feedback  to  improve 
courses? 

> 

i  ' 

le 

Do  you  use  test  results  for  diagnostic 
purposes? 

[>■) 

. .. ) 

i  •- 

28 

Are  you  familiar  with  team  performance 
testing? 

For  complete  wording  of  the  protocol  Items,  see  Appendix  A 


Less  than  two-thirds  of  the  subjects  questioned  indicated  that  test 
results  are  used  to  compare  trainees.  Comparing  individuals  on  the  basis  of 
test  results  is  essentially  norm-referenced.  It  is  possible  however,  to 
employ  CRTs  for  norm-re ferenced  purposes.  In  BCT,  for  instance,  trainees 
who  pass . the  comprehensive  performance  test  on  their  first  try  might  be 
considered  for  promotion  from  El  to  E;\  while  those  who  do  not  may  not  be 
so  considered. 

Considerably  less  than  half  of  the  subjects  questioned  said  that  they  ' 
were  familiar  with  team  performance  testing  situations.  Further,  of  those 
who  indicated  familiarity  with  the  concept,  many  indicated  that  team  perfor¬ 
mance  testing  is  often  individual  evaluation  in  a  team  context.  Actually, 
the  testing  of  team  performance  was  very  limited  on  the  Army  posts  visited. 

Typos  of  Tests,  Table  *>  shows  a  description  of  types  of  tests  constructed 
or  used  by  subjects  in  our  survey  sample,  based  upon  their  responses  to 
.  rotocol  item  Part  1  is  a  categorization  according  to  test  mode,  Part 

according  to  test  use.  For  both  parts,  subjects  were  asked  to  indicate  the 
approximate  percentage  of  each  type  test  with  which  they  were  involved. 
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Table 

TYPES  OF  TESTS  CONSTRUCTED  OR  USED: 
SUMMARY  OF  RESPONSES 
TO  PROTOCOL  ITEM  .v; 

ACROSS  ALL  POSTS 


Item  07  -  Part  1 
N  *  93 


Wnat  proportion  of  the  tests  you  have  participated  in  making  or  using 
are : 

Mean  Response 


A. 

Paper-and-penci l  knowledge  tests? 

f  ■  ••  ^ 

B. 

Simulated  performahce  tests?  (o.g., 
mockups  and  drawings^ 

using 

7.9ft 

C. 

"Hands-on"  performance  tests? 

LI.  1ft 

D. 

Other? 

9.9ft 

Total : 

100ft 

I tem  27  -  Part  2 
N  -  73 


What  proportion  of  the’ tests  you  have  participated  in  making  or  using  are 
for: 

Mean  Response 


A.  Specific  skill  and  knowledge  requirements?  39 .Lft 

B.  Specialty  areas  in  a  course?  ■ 

C.  End  of  block  within  a  course?  30.0ft 

D.  Mid  cycle  within  a  course?  6.9ft 

E.  End  of  course?  lo ,0ft 


Total:  100ft 
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It  appears  that  most  tests  are  either  paper-and-pencil  knowledge 
tests  or  hands-on  performance  tests.  Although  Table  ^  indicates  that 
paper-and-penci  1  knowledge  tests  are  nearly  ‘30$  of  those  created  and 
used,  many  subjects  confused  paper-and-per.cil  knowledge  tests  with  paper- 
and-pencil  performance  tests.  This  was  learned  from  discussions  with 
interviewees.  In  many  areas,  paper-and-pencii  tests  are  equivalent  to 
the  performance  called  for  in  the  actual  task  situation.  For  example, 
such  diverse  areas  as  map-making  and  Aiming  artillery  require  paper-and- 
penci  1  performance.  Maps  must  be  drawn  to  scale,  while  in  many  cases  the 
aiming  of  artillery  requires  mathematical  computations.  It  is  estimated 
that  about  haLf  of  the  responses  in  the  paper-and-pencil  knowledge  test 
category  actually  referred  to  paper-and-pencil  performance  testing.  Ihus, 
responses  to  Part  1  of  Item  27  can  be  interpreted  to  indicate  that  nearly 
three-quarters  of  the  tests  constructed  or  used  are  performance  tests  of 
one  sort  or  another.  These  results  accord  with  the  emphasis  on  perfor¬ 
mance  testing,  and  indicate  that  performance  testing  has  become  widespread 
in  many  phases  of  Army  evaluation. 

Responses  to  Part  2  indicate  that  tests  measuring  specific  skill  and 
knowledge  requirements,  and  those  used  at  ends  of  blocks  of  instruction, 
account  for  about  ''O'jt  of  test  construction  and  use.  Mid-cycle  tests  and 
end-of-course  tests  together  account  for  less  than  one-quarter  of  the 
tests.  Responses  to  Part  2  of  Item  27  indicate  that  tests  are  well 
distributed  throughout  instruction.  This  is  good  news  since  frequent 
testing  can  provide  frequent  feedback  and  the  possibility  for  on-going 
remediation. 

Problems .  Table  6  presents  a  summary  of  responses  to  protocol  items 
dealing  with  problems  in  the  development  and  use  of  CRTs.  Over  two-thirds 
of  t’.'.e  subjects  (who  were  primarily  supervisory  personnel  for  this  Item^ 
indicated  that  increased  expense  may  be  a  problem  in  the  development  and 
use  of  CRTs.  Several  subjects  conmented  that  the  extra  expense  may  be  a 
factor  in  reducing  the  availability  of  CRTs  in  the  Army.  However,  many 
individuals  Indicated  that  increased  expense  is  a  short-term  factor,  and 
that  in  the  long  run,  criterion-referenced  testing  is  less  expensive  than 
is  norm-referenced  testing.  Criterion-referenced  testing  is  presumably 
less  costly  in  terms  of  insuring  the  efficient  output  of  well-trained 
soldiers. 

Many  individuals'  in  the  survey  sample  felt  that  time  pressures,  or 
other  constraints,  often  prevent  successful  construction  and  use  of  tests. 

In  discussion,  subjects  indicated  that  time  pressure  is  the  most  common 
constraint,  and  that  time  pressures  are  usually  present  in  test  development. 
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Table  6 


utNERAL  PROBLEMS  IN  THE  DEVELOPMENT  AND  USE 
OF  CRITERION-REFERENCED  TESTS: 
SUMMARY  OF  RESPONSES  ACROSS  ALL  POSTS 


1  tern 
No. 

Brief  Statement  of  Item3 

Number  of 
Subjects 
Responding 
to  Item 

Percent 

of- 

"Yes" 

Responses 

=>0 

Have  time  pressures,  or  other  con¬ 
straints  prevented  successful  trst, 
test  construction  and  use? 

89 

6l 

* 1 

Have  you  seen  tests  which  were 
unsuitable  for  their  intended  uses? 

’ 

;v> 

Are  Criterion-Referenced  Tests  more 
expensive  to  develop  and  use  than 
norm-referenced  tests? 

U9 

n 

For  complete  wording  of  the  protocol  items,  see  Appendix  A 


However,  time  pressures  and  other  constraints  do  not  usually  interfere  with 
test  administration  tasks.  Usually,  teats  are  administered  satisfactorily 
despite  time  pressures.  Interviewees  seemed  to  think  that  Army  test  devel¬ 
opment  and  administration  have  improved  greatly  in  recent  years. 

Attitudes.  Table  7  presents  a  summary  of  subject  attitudes  concerning 
criterion-referenced  testing  in  the  Army.  In  general,  subjects  were  in 
favor  of  the  Army  trend  toward  criterion-referenced  testing.  Comments 
included:  "Criterion- referenced  testing  is  the  best  system  of  testing  yet 
devised";  "It  is  the  only  way  to  go";  "It  is  a  terrific  improvement  over 
testing  in  the  old  Army";  "Criterion-referenced  testing  should  be  used 
exclusively  in  the  Army  and  wherever  else  possible,  including  civilian 
educational  institutions."  Eighty-eight  percent  of  the  individuals 
responding  felt  that  criterion-referenced  testing  should  receive  high  or 
top  priority  in  terms  of  Army  assessment  programs.  Sixty  percent  felt  that 
criterion-referenced  tests  should  replace  most  or  all  norm-referenced  tests. 

/  , 

Subjects  felt  that  criterion-referenced  testing  is  practical  and  useful 
in  measuring  job  performance  skills.  No  other  item  on  the  survey  protocol 
elicited  a  100$  positive  response.  In  addition,  many  individuals  felt  that 
criterion-referenced  testing  would  be  useful  and  practical  for  measuring 
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Table  7 


ATTITUDES  CONCERNING  CRITERION-REFERENCED  TESTING: 
SUMMARY  OF  RESPONSES  TO  PROTOCOL 
ITEMS  >4  AND  40  ACROSS  ALL  POSTS 


I  tem  ;S4 

How  strongly  do  you  feel  about  future  use  of  Criterion-Referenced  Testing 
in  the  Army?  Should  Criterion-Referenced  Test  development  receive  high 
or  low  priority  in  terms  of  Army  assessment  programs? 

N  =  v°0 


Percent  Responding 
to  Each  Alternative 

Strongly  against — Criterion-Referenced  Testing  should 
receive  bottom  priority,  or  dropped  entirely. 

Against-Criterion-Referenced  Testing  should  receive 
low  priority. 

Neutral— Criterion-Referenced  Testing  should  receive 
average  priority. 

For— Criterion-Referenced  Testing  should  receive  high 
priority. 

oO  Strongly  for — Criterion-Referenced  Testing  should 

receive  top  priority,  Criterion-Referenced  Tests 
should  replace  most  or  all  norm-referenced  tests. 

Total:  10OJ6 


I tem  40 

Do  you  feel  that  Criterion-Referenced  Testing  is  practical  and  useful  in 
measuring  job  performance  skills?  . 

Number  of  Interviewees  Responding  *  64 
Percent  responding  "yes"  »  100 


areas  other  than  job  performance  skills.  Knowledge  tests,  for  example, 
yere  seen  by  many  as  a  practical  and  useful  application  of  the  criterion- 
referenced  concept. 

DISCUSSION  OF  CRT  SURVEY 

Over  I'O  hours  of  interviews  were  conducted  during  the  survey  of 
criterion-referenced  testing  in  the  Army.  Topics  covered  ranged  from  the 
extent,  utility,  and  practicality  of  CRT  use  in  the  Army,  to  problems  in 
implementing  CRTs. 

Although  criterion-referenced  testing  is  used  in  today's  Army,  many 
NRTs  are  in  use  also.  This  is  not  surprising,  since  criterion-referenced 
testing  is  a  relatively  new  concept.  It  was  apparent  from  the  survey, 
however,  that  CRT  use  is  increasing. 

At  each  installation  visited,  criterion-referenced  testing  was  in 
evidence.  The  combat  arms  schools  visited — Infantry,  Armor,  Artillery  and 
Air  Defense- -develop  and  use  a  number  of  CRTs.  However,  school  implementa¬ 
tion  of  criterion-referenced  testing  is  in  the  beginning  stages.  Some 
departments  are  making  serious  attempts  to  incorporate  CRTs,  while  others 
are  only  minimally  involved.  Many  employ  criterion-referenced  terminology, 
but  do  not  produce  true  CRTs.  This  is  especially  true  in  "soft  skill" 
areas,  such  as  tactics  and  leadership.  Most  academic  departments  within 
these  four  combat  arms  schools  indicated  that  many  of  their  tests,  especially 
the  written  ones,  are  graded  on  a  curve.  Much  reliance  appears  to  be  placed 
upon  subjectively  graded  paper-and-pencil  tests  and  upon  computer-graded 
objective  tests. 

MOS  testing  continues  to  be  primarily  norm-referenced.  Most,  if  not  all, 
MOS  tests  rely  on  situational  multiple-choice  Items.  Because  of  the  low 
fidelity  of  such  items,  it  is  often  difficult  to  determine  if  they  are 
criterion-  or  norm-referenced.  On  the  surface,  at  least,  they  are  suspi¬ 
ciously  similar  to  conventional  knowledge  test  questions. 

Consideration  of  the  CRT  concept  is  being  given  to  Training  Extension 
Course  packages.  The  optional  "audio-only”  performance  test  appended  to 
such  TEC  packages  requires  further  development  and  implementation  so  that 
TEC  instruction  can  be  more  thoroughly  evaluated  in  a  criterion-referenced 
fashion. 

At  Fort  Ord,  California,  CRTs  are  employed  both  in  BCT  and  in  AIT. 
Although  there  are  problems  in  Che  administration  of  the  Comprehensive 
Performance  Tests  (a  type  of  CRT  used  toward  the  end  of  basic  training^  the 
testing  experiences  at  Fort  Ord  should  be  able  to  serve  as  a  good  "field 
laboratory"  for  developing  CRT  applications. 
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AIT  in  diverse  ar^as  such  as  field  wiring  and  food  services  appears  to 
be  benefiting  from  the  use  of  CRTs.  Preliminary  indications  are  that  more 
soldiers  ire  being  evaluated  more  effectively  throug:  application  of 

criterion-referenced  testing.  Further,  instructors,  s_  -visors  and 
students  all  appear  to  be  favorably  disposed  toward  CRT-. 

In  general,  although  criterion-referenced  testing  is  not  extensive,  there 
are  many  instances  of  serious  attempts  at  CRT  development  and  use  at  the 
Army  installations  visited. 2  There  was  much  respect  for  the  utility  and 
practicality  of  criterion-referenced  testing.  As  noted,  many  interviewees 
were  strongly  in  favor  of  increased  use  of  criterion-referenced  testing  in 
the  Army.  Many  who  had  experience  with  developing  or  ising  such  tests 
indicated  increased  evaluation  effectiveness,  increased  individual  morale 
and,  in  the  long  run,  reduced  expense  as  a  function  of  CRTs.  Despite  this 
high  regard,  there  was  too  little  rigorous  development  or  application  of 
CRTs.  While  progress  is  being  made  toward  achieving  rigor  in  "hard  skill" 
areas,  especially  in  equipment-related  skills,  attempts  in  "soft  skill"  > 

areas  are  lacking.  Personnel  who  develop  tests  for  such  areas  in  many  cases 
are  attempting  to  develop  CRTs,  but  are  diverted  at  the  outset  since  genuine  j 

difficulties  in  specifying  objectives  explicitly  are  often  encountered. 

The  survey  revealed  virtually  no  evidence  of  criterion-referenced  testing 
in  team  performance  situations.  In  fact,  as  many  subjects  pointed  out, 
operational  units  are  not  fontfSd  until  after  AIT.  This  does  not  mean,  , 

however,  that  CRT  development  for  unit  performance  is  inappropriate.  Such 
tests  could  be  developed  and  used  in  AIT  and  then  exported  to  field  units. 

Although  problems  may  occur  when  an  individual  begins  to  work  within  a  field 
unit,  this  is  not  an  argument  against  unit  CRTs. 

The  CRT  Construction  Manual.  Subjects  at  all  levels  indicated  a  need 
for  increased  development  and  use  of  criterion-referenced  testing  in  the  Army. 

Many  indicated  the  need  for  guidance  in  constructing  and  in  administering 
CRTs.  A  consensus  indicated  that  such  guidance  should.be  written  in  simple, 
straightforward  language  and  should  address  criterion-referenced  testing  in 
a  non- theoretical,  practical  manner.  Individuals  interviewed  in  the  survey 
indicated  that  a  manual  of  this  type  would  be  well  received  at  all  levels 
in  test  development  and  evaluation  units . 


2 

Many  of  the  personnel  interviewed  confused  CRTs  wi<h  "hands-on"  performance  testng.  In  terms  of 
implementing  hands-on  performance  testing  programs,  the  trend  at  the  Army  posts  visited  is  dramatic; 
many  such  tests  are  in  evidence.  Not  all  of  these  tests  are  criterion-referenced,  however;  many  are  not. 

In  order  to  be  called  criterion-referenced,  an  individual  testee's  skills  or  knowledges  must  be  compared  to 
some  external  standard.  This  means  that  test  items  must  be  matched  to  objectives  which  are  derived  from 
valid  performance  data.  This  is  not  the  case  for  a  significant  proportion  of  the  "hands-on"  performance 
tests  presently  used  at  the  sites  surveyed. 
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lo rmcnt  Process .  A  number  of  difficulties  in  CRT  development 
and  use  were  observed  and/or  described  during  the  survey,.  First,  the 
development  of  CRTs  must  be  derived  from  well-specified  objectives  which 
are,  in  turn,  the  results  of  careful  task  analyses.  Unfortunately,  task 
analysis  data  are  not  available  in  many  cases,  and  in  cases  where  they  are 
available,  thov  are  often  disregarded.  Many  test  developers  write  state¬ 
ments  of  performance  standards  from  Plans  of  Instruction  ii’OIs'  or  from 
Army  Subject  Schedules.  In  most  cases,  these  TOIs  and  schedules  are  based 
upon  task  analyses.  However,  otter,  the  critical  source  data  are  not 
readily  apparent.  In  other  cases,  objectives  are  defined  "out  of  the  blue" 
by  subject  natter  experts  who  may  be  unfamiliar  with  the  instructional 
system  development  process.  Worse  yet,  in  some  cases  careful  task  analyses 
have  been  developed  and  then  ignored.  For  example,  In  one  AIT  course 
visited,  a  careful  task  analysis  had  been  conducted  which  accurately  docu¬ 
mented  critical  behaviors.  Although  the  performance  tests  used  in  the  course 
were  developed  from  objectives  derived  from  the  task  analysis,  the  recently 
revised  subject  schedule  ignored,  and  in  some  cases  flatly  contradicted,  the 
task  analytic  data.  As  a  result,  the  revised  subject  schedule  required 
testing  skills  that  the  task  analysis  had  revealed  are  performed  very 
infrequently;  but  did  not  mention  other  skills  which,  according  to  the  task 
analysis,  were  most  frequently  performed. 

Many  difficulties  in  CRT  development  can  be  overcome  if  task-analytic 
data  are  actually  used  in  the  development  of  tests.  When  tests  are  modified 
for  local  administration,  those  responsible  for  the  modification  should  have 
access  to  the  same  task-analysis  data. 

Tract  lea  1  Cons traint s .  The  CRT  survey  suggested  that  nriorlties  and 
practical  constraints  for  tasic  objectives  are  usually  assessed  informally. 

If  task  priorities  are  not  accurately  assessed  and  defined,  the  development 
of  tost  items  which  measure  the  achievement  of  objectives  is  exceedingly 
difficult.  If  all  objectives  are  taken  to  be  of  equal  weight,  then  they  will 
normally  be  assessed  by  an  equal  number  of  test  items  when,  in  fact,  more 
important  objectives  may  require  more  thorough  testing. 

Frequently,  practical  constraints  to  the  testing  situation  are  considered 
only  as  an  afterthought.  Constraints  which  operate  in  the  testing  situation 
should  rightful ly  be  considered  while, a  test  is  being  developed.  Some 
Soldier's  Manual  Army  Testing  (SMART)  books,  for  example,  show  a  minimal 
regard  for  practical  testing  constraints.  They  contain  lengthy  checklists 
which,  although  possibly  of  use  in  evaluating  an  individual's  performance, 
cannot  be  followed  by  test  administrators.  In  some  cases,  one  testin'  may 
administer  a  SMART  test  Co  many  soldiers  simultaneously,  although  totally 
unable  to  observe  all  items  on  the  SMART  checklist.  Thus,  at  a  given 
testing  station,  a  particular  soldier  may  be  scored  as  a  "no  go"  while  another 
soldier  may  bo  scored  "go"  because  the  tester  could  only  observe  one 
accurately.  The  problem  of  including  practical  testing  constraints  and 
task  priorities  can  be  solved  by  training  test  developers  to  consider  these 
as  an  integral  part  of  the  cost  development  process. 


I tem  Pools .  Test  developers  seem  to  have  little  difficulty  creating 
items  if  the  performances,  standards,  and  conditions  are  accurately' 
specified.  However,  many  Army  test  developers  surveyed  indicated  that  they 
wrote  only  the  precise  number  of  items  required  for  a  specific  test.  These 
items  are  typically  reviewed  by  subject  matter  experts  and  are  then  revised 
accordingly.  If  alternate  forms  of  a  test  are  required,  a  pool  of  items 
are  constructed  such  that  a  computer  can  format  alternate  test  forms  by 
selecting  a  subset  of  items  from  the  pool.  Rarely  are  extra  items  written. 
Accordingly,  there  is  no  empirical  selection  process  for  final  test  items. 
Items  are  typically  dropped  or  revised,  after  a  review,  if  large  numbers 
of  individuals  in  a  class  answer  them  incorrectly. 

Creating  a  test  item  pool  should  become  a  standard  part  of  the  test 
development  process.  If  twice  as  many  items  are  developed  as  are  needed 
for  a  specific  test,  the  test  can  be  tried  out  and  the  final  items  selected 
empirically.  An  empirical  item  analysis  strategy  should  be  incorporated  to 
select  final  test  items.  Although  the  creation  of  item  pools  and  the  use  of 
item  analysis  techniques  may  introduce  added  expense  into  the  te^t  develop¬ 
ment  procedure,  the  payoff  should  outweigh  the  expense.  The  payoff  here  is 
the  development  of  items  that  are  feasible  and  which  reliably  address  appro¬ 
priate  criterion  behaviors. 

Reliability  and  Validity.  A  major  omission  in  the  development  of  CRTs, 
as  observed  during  the  Army  survey,  is  the  lack  of  test  evaluation.  There 
was  virtually  no  consideration  of  test  reliability  and/or  validity.  This 
does  not  indicate  that  the  tests  as  developed  are  unreliable,  but  that  the 
question  has  not  been  addressed.  A  few  subjects  did  indicate  that  content 
validity  had  been  considered  by  virtue  of  careful  matching  of  test  items 
and  task  objectives.  Content  validity  however,  is  not  necessarily  the  only 
type  of  validity  appropriate  for  CRTs.  Predictive  validity  can  also  be 
assessed.  That  is,  trainees  can  be  tested  using  CRTs  and  then  evaluated 
under  field  conditions  performing  the  tasks  for  which  they  have  been  trained. 
Test  results  for  a  valid  test  should  be  congruent  with  later  field  perfor¬ 
mance,  results . 

Army  test  developers  should  be  instructed  in  techniques  for  establishing 
reliability  and  validity  of  CRTs.  Even  if  a  test  evidences  content  validity 
as  a  function  of  careful  creation  based  upon  task  objectives,  reliability  is 
still  in  question.  If  a  test  cannot  be  administered  reliably,  results  are 
meaningless. 

Administration.  A  poorly  administered  test  defeats  long  hours  of  careful 
test  development.  The  CRT  survey  indicated  that  Lack  of  standardized 
testing  conditions  exist  in  many  areas.  This  is  in  part  attributable  to 
lack  of  training  in  test  administration  for  testers,  and  in  part  to  lack  of 
clearly  defined  test  administration  instructions. 

One  administrative  problem  observed  was  that  soldiers  may  be  aided  or 
hindered  as  a  function  of  their  position  in  the  performance  testing  line. 

Those  who  are  not  first  in  line  "get  a  break"  by  observing  mistakes  of  others. 


The  test  administrative  conditions  should  specify  that  trainees  waiting 
to  be  tested  remain  at  a  certain  distance  from  the  test  site,  or  the  test 
administrators  should  be  instructed  in  conducting  such  tests  in  standard¬ 
ized  manner,  or  both. 

Careful  instructions  in  test  administration  are  necessary  to  insure 
accurate  testing.  Steos  should  be  taken  to  insure  that  test  administration 
practices  are  clearly  defined  for  each  test,  and  that  test  administrators 
are  adequately  trained.  Further,  test  sites  should  be  regularly  inspected 
to  insure  that  tests  are  being  given  under  the  specified  standard  conditions. 
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APPENDIX  A 


INTERVIEW  PROTOCOL: 

SURVEY  OF  CRITERION- REFERENCED  TESTING  IN  THE  ARMY 


*  =  Optional  question:  Ask 
as  appropriate 


Name  of  Interviewee: 


Mailing  Address: 


Telephone  Number: 

Introduction.  Interviewer  will: 

A.  Introduce  himself 

B.  Introduce  ASA 

C.  Explain  that  ASA  is  doing  contract  work  for  the  Army  Research  Institute 

D.  State  that  ASA  is  interested  in  improving  tests  for  the  Army 

E.  Explain  that  ASA  wants  to  find  out  about  current  status  of  testing  in 
the  Army  so  we  can  determine  what  we  can  build  on 


What  is  your  position  in  the  organization  here? 

What  school  or  center  are  you  in?  _ 

What  is  your  directorate,  department, 
or  unit?  _ 

What  is  your  branch  or  section?  __ ___ 

What  is  your  position  and  title?  _ 


2.  How  long  have  you  been  involved  in  testing?  Years _ 

3.  What  did  you  do  before  you  became  involved  in  testing? 


Months 


PJOI  aua* 


i 


Interviewer  Statement:  Now,  I  would  like  to  discuss  with  you,  some  tasks 
that  may  be  involved  in  test  construction  and  use.  These  tasks  are  done  in 
different  ways  in  different  places.  •  ‘times  they  are  combined,  in  other 
cases  some  are  eliminated.  They  often  go  by  different  names.  Would  you 
please  tell  me  which  of  these  you  are  involved  in. 

*  4.  Writing  objectives.  That  is- -determining  what  the  test  will  measure  and 

the  conditions  under  which  the  measurement  will  occur  in  terms  of 
,  precise,  behavioral  statements. 

Have  you  been  involved  in  writing  objectives?  Yes  No _ 

If  yes,  (a)  how  long  have  you  been  doing  this?  Years _  Months _ 

(b)  do  you  write  objectives  in  operational,  behavioral  terms? 

Yes  No _  Don't  understand _ _ 

*  5*  Setting  standards.  That  is — defining  the  standards  against  which  per¬ 

formance  is  evaluated.  In  many  cases,  these  standards  are  very  similar 
to  tl  _  stated  objectives. 

Have  you  participated  in  setting  standards?  Yes _  No _ _ 

If  yes,  how  long  have  you  been  doing  this?  Years _  Months _ 

*  6.  Imposing  practical  cons. .aints.  That  is--deciding  how  the  test  must  be 

built  so  it  can  actually  be  used  within  the  limits  of  the  situation  for 
which  it  is  designed.  For' example,  there  are  often  time  constraints 
involved  in  testing  complex  skills. 

Have  you  been  involved  in  this?  Yes  No 

If  yes,  how  long  have  you  been  doing  this?  Years _  Months _ _ 

*  7.  Determining  priorities.  That  is — deciding  how  important  each  standard  is 

in  relation  to  other  standards. 

Have  you  helped  determine  priorities?  Yes  No 

If  yes,  how  long  have  you  been  involved  in  determining  priorities? 

Years  Months _ 

*  8.  Writing  items.  That  is — creating  items  for  use  in  the  test. 

Have  you  written,  or  helped  to  write  items?  Yes  No 

If  yes,  (a)  how  long  have  you  been  involved  in  writing  items? 

Years _  Months 

(b)  does  your  group  of  items  usually  contain  more  than  will  be 
included  in  the  test?  Yes  No _  Don't  know 


r 


*  9.  Selecting  final  teat  items'.  That  is — applying  statistical  tests  to 

determine  the  most  useful,  non-redundant  item?. 

Have  you  been  involved  in  selecting  final  test  items?  Yes _  No _ 

If  yes,  (a)  for  how  long  have  you  done  such  work?  fears _  Months _ 

(b)  do  you  use  an  item  analysis  technique? 

Yes  No _  Don't  know _ 

*  10,  Test  administration.  That  is- -administering  the  test  in  the  situations 

for  which  it  was  planned.  Also,  test  administration  is  often  done  as  a 
try-out,  before  the  test  is  finalized. 

Have  you  participated  in  administering  tests?  Yes _  No 

If  yes,  (a)  for  how  long  have  you  done  so?  Years _  Months _ 

(b),  have  you  ever  found  it  appropriate  to  give  help  to  someone 
taking  the  test  if  they  could  not  continue  without  help  on 
a  particular  item?  Yes _  No _  Don’t  know _ 

*  11.  Measuring  reliability.  That  is — determining  if  a  test  will  give  similar 

scores  vhen  measuring  similar  performance.  For  example,  a  person  taking 
equivalent  versions  of  the  same  test  should  score  about  the  same  on  both, 
if  he  has  had  no  practice  in  between* 

Have  you  been  involved  in  measuring  the  reliability  of  tests?  Yes  No_ 
If  yes,  (a)  how  long  have  you  been  involved  in  measuring  reliability? 

Years _  Months  ' 

(b)  do  you  compute  coefficients  of  reliability? 

Yes _  No _  Don't  know 

*  12.  Evaluating  validity.  The  test  developer  must  determine  whether  the  test 

is  actually  measuring  what  it  is  supposed  to  measure.  Personnel  who  score 
high  on  the  test  should  also  perform  very  well  on  the  tat,k  that  test  is 
supposed  to  measure,  while  those  who  score  low  should  not  be  able  to 
perform  the  task  as  well.  . 

Have  you  helped  to  validate  tests?  Yes  No _ 

If  yes ,  (a)  how  long  have  you  been  doing  so?  Years _  Months 

(b)  do  you  use  content  validity  as  opposec-  to  predictive  validity? 
Yes _  No _  Don' t  know _ 


li. 


Scoring.  Hov  are  tests  generally  scored?  Are  norms  set  as  standards 
using  bell  shaded  curves,,  or  are  "go-no  go"  type  standards  used? 

Norms  go-no  go _  Other _ 


To  what  uses  are  the  test  scores  put? 

!•*  •  One  might  be  using  test  results  to  compare  student  performance.  Higher- 
scoring  students  might  be  considered  for  promotion  for  example,  while 
those  passing  with  a  lower  score  might  not  be  so  considered. 

Co  you  test  results  to  compare  students? 

Yes _  No _ _ 

If  yes,  (a)  how  long  have  you  used  test  scores  for  comparisons? 

Years  Months _ 

^b)  If  a  student  doesn't  get  a  passing  score  the  first  time,  is 
he  tested  again?  Yes  No  Don't  know _ 

1  •  Another  use  might  be  using  test  results  to  evaluate  course  adequacy. 
Sometimes  the  results  of  tests  are  used  to  evaluate  the  success  of  a 
course.  Portions  of  a  test  that  many  students  fail  to  perform  well  on 
are  seen  as  reflecting  a  deficiency  in  the  corresponding  portion  of  a 
course.  Courses  can  then  be  improved,  using  test  results  as  feedback. 

Have  you  used  test  results  to  help  improve  courses?  Yes  No 

If  yes,  (a)  how  long  have  you  been  doing  so?  Years _  Months _ 

ibl  when  you  do  so,  are  test  criteria  based  on  task  objectives, 
rather  than  on  course  content?  Yes _  No _  Don't  know 

1<>.  Another  use  might  bo  using  test  scores  to  diagnose  areas  in  which  students 
needed  improvement. 

Do  you  use  to,;ts  for  diagnostic  purposes?  Yes  -----  -  No 

If  yes,  how  long  have  you  been  doing  this?  Years _  Months 

ly.  Are  there  other  aspects  of  test  development  and  use  that  you  are  aware  of 
but  1  did  not  mention?  Yes  No 

If  yes,  what  are  they? 


Interviewer  Statement:  Now  I  would  like  to  discuss  some  of  the  tasks  that 
you're  involved  in. 

19.  What  inputs  do  you  have  available  in  terms  of  documents,  data,  job  aids, 


20.  Which  of  these  i  .puts  do  you  actually  use? 


*21.  (if  answer  to  20  is  other  than  "all  of  them",  interviewer  asks  #21] 
Why  do  you  use  these  and  not  the  others? 


} How  are  these  outputs  used? 


24;  What  problems  have  you  encountered?  . 


Is  any  special  training  available  for  testing  personnel?  Yes _  No 

If  yes,  please  briefly  describe  this  training? 


What  proportion  of  the  tests  you  have  participated  in  making  or  using  are: 

A.  Paper-and-pencil  knowledge  tests?  ____________ 

B.  Simulated  performance  tests?  e.g.,  using 

mockups  and  drawings  _ 

C.  "Hands  on"  performance  tests?  _ 

D.  Other?  Specify:  _ .  _ 


What  proportion  of  the  tests  you  have  participated  in  making  or  using  are 
for: 

A.  Specific  skill  and  knowledge  requirements?  . _ 

B.  Specialty  areas  in  a  course?  _ 

C.  End  of  block  within  a  course?  _ 

D.  Mid  cycle  within  a  course?  ____________ 

E.  End  of  course?  _ 

Are  you  familiar  with  any  team  performance  situations  that  were  evaluated 
by  tests?  Yes _  No 

Would  you  briefly  describe  how  tests  were  used  to  measure  team  performance? 


Have  time  pressures,  or  other  constraints,  prevented  you  from  successfully 
carrying  out  some  -of  the  tasks  involved  in  test  construction  and  use? 

Yes _  No _ 

If  yes,  describe  how  you  were  affected  by  a  constraint. 


>1.  Can  you  describe  any  cases  in  which. tests  were  developed  which  were  not 
suitable,  in  your  opinion,  for  the  intended  uses?  Yes _  No _ 

Description:  _ _______ 


If  it  is  the  interviewer's  opinion  that  interviewee 
does  not  understand  the  distinction  between  Criterion- 
Referenced  Testing  and  norm-referenced  testing: 


'  STOP  HERE 

Otherwise  go  on. 


2. 


One  of  the  main  purposes  of  our  work  for  the  Army  is  to  develop  a  manual 
on  how  to  construct  Criterion-Referenced  as  opposed  to  Norm-Referenced 
Tests.  Who  will  be  the  primary  users  of  a  manual  of  this  type  on  this 
post? 


>3.  As  you  know,  in  recent  years  the  Army  has  put  increasing  emphasis  on  using 
Criterion-Referenced  Tests  in  appropriate  testing  situations.  There  is 
still  much  disagreement,  though,  about  what  a  Criterion-Referenced  Test 
really  is.  How  is  the  term  "Criterion-Referenced  Test"  used  on  this  post? 


34.  How  strongly  do  you  feel  about  future  use  of  Criterion-Referenced  Testing 
in  the  Army?  Should  Criterion-Referenced  Test  development  receive  high 
or  low  priority  in  terms  of  Army  assessment  programs? 

_ Strongly  against — Criterion-Referenced  Testing  should  receive  bottom 

priority,  or  dropped  entirely. 

Against--Criterion-Referenced  Testing  should  receive  low  priority. 
Neutral--Criterion-Refefenced  Testing  should  receive  average  priority. 


_ Fot- -Criterion-Referenced  Testing  shouLd  receive  high  priority. 

_ Strongly  for--Criterion-Referenced  Testing  should  receive  top 

priority,  Criterion-Referenced  Tests  should  replace  most  or  all 
norm-referenced  tests. 

Do  you  think  cost  is  a  major  factor  in  determining  whether  Criterion- 
Referenced  Tests  are  developed  and  administered  in  the  Army?  That  is--have 
you  found  that  Criterion-Referenced  Tests  are  more  or  less  expensive  to 
develop  and  administer  than  conventional,  norm- referenced  tests? 

Less  expensive _  About  the  same _  More  expensive 

Could  you  describe  a  situation  in  which  a  Criterion-Referenced  Test  was 
found  to  be  prohibitively  expensive  to  develop? 


37*  Do  you  think  that  there  are  any  particular  advantages  or  disadvantages  to 
developing  and  using  Criterion-Referenced  tests  in  the  Army  (as  opposed 
to  norm-referenced  measures)?  Yes _  No _ 

What  are  some  advantages  or  disadvantages? 


58.  Are  there  any  special  problems  you  have  encountered  while  developing  or 
using  Criterion-Referenced  Tests,  as  opposed  to  problems  normally 
encountered  with  norm-referenced  tests?  Yes _  No _ 

If  yes,  describe  these  special  problems  and  how  you  overcome  them: 


*58«  How  serious  are  these  problems?  That  is.  How  much  do  they  affect  the 
overall  accomplishment  of  testing  objectives? 


40. 


*41. 


Do  you  feel  that  Criterion-Referenced  Testing  is  practical  and  useful  in 
measuring  lob  performance  skills?  Yes  No 

Why? _ 


Are  there  other  areas  (such  as  knowledge  tests  and  achievement  tests)  where 
this  concept  could  be  useful?  Yes  No 

Why? _ 


42.  What  should  we  include  to  make  the  manual  useful? 


APPENDIX  B 


SUMMARY  OF  TYPES  OF  PERSONNEL  INTERVIEWED  AT  ARMY  INSTALLATION 


Table  B-l 

FORT  BENNING  INTERVIEWEES 


Classification  Area 

Directorate, 
Department  or  Division 

Job  Title  of  Into 

rviewee 

U.S.  Army  Infantry 

School 

Directorate  of  Educational 
Technology 

Deputy  Director 

(S)** 

■  . 

Faculty  Development 

Division 

Chief 

Senior  Instructor 

(S) 

(DU)* 

1 

Instructor 

(DU) 

Instructor 

(DU) 

Instructor 

(DU) 

Student 

(DU) 

Brigade  &  Battalion 

Operations  Department 
(BBOD) 

Operations  &  Training 
.  Techniques 

Chairman 

(S) 

Tactics  Group 

Test  Officer 

(S) 

Project  Officer 

(DU) 

Combat  Support  Group 

Instructor 

(DU) 

— 

Instructor. 

(DU) 

•  • 

Instructor 

(DU) 

♦♦Supervisors  of  Test  Development  ■  (s) 
♦Test  Developers  or  Users  »  (DU) 


Table  B-l  (continued) 


Directorate, 

1 

Classification  Area 

Department  or  Division 

1  Job  Title  of  Interviewee 

U.S.  Army  Infantry 

Directorate  of  Instruction 

Chief 

CS) 

School  (continued) 

Evaluation  Division 

Evaluation  Staff 

(DT*‘ 

Curriculum  Division 

Director  of 

Instruction 

(S) 

■ 

Office  of  Directorate 

of  Doctrine  &  Training 

Task  Analysis  Division 
Training  Management 

Chief 

(S) 

Team 

Chief 

(S) 

Office  of  Medical  Staff 

&  Operations 

Instructional  Division 

Chief 

Chairman,  Resident 

(DU) 

Committee 

(DU) 

Weapons  Department 

Mortar  Committee 

.  Instructor 

(DU) 

TEC  Frogram 

Chief 

(S) 

MOS  Testing  Program 

Chief 

(S) 

Table  B-2 


FORT  BLISS  INTERVIEWEES 


Directorate, 

Classification  Area 

Department  cr  Division 

Job  Title  of  Interviewee 

U.S.  Army  Air  Defense 

High  Altitude  Missile 

Training  Specialist 

(S)* 

School 

Department 

Chief  Project  Officer  for 

Curriculum 

(S) 

Training  Specialist 

(DU) 

Missile  Electronic  &  Con- 

Technical  Publications 

trol  Systems  Department 

Editor 

(S) 

■ 

Instructor 

(DU) 

Command  &  Staff  Department 

Chief,  Command  &  Leader- 

ship  Division 

(S) 

Instructor 

(DU) 

■ 

Department  Staff 

(DU) 

Army-wide  Training  Support 

Educational  Specialist 

(DU) 

Division 

Educational  Specialist 

(DU) 

Assistant  Chief  of 

Course  Development 

(DU) 

Low  Altitude  Air  Defense 

Instructor 

(DU) 

Department 

Instructor  &  Technical 

'  ■  . 

Writer 

(DU) 

Department,  Staff 

(DU) 

Ballistic  Missile  Defense 

Training  Specialist 

(DU) 

Department 

Instructor 

(DU) 

Deputy  Commandant  for 

Executive  Officer 

(S) 

Training  &  Education 

Staff 

(S) 

♦♦Supervisors  of  Test  Development  *  (s) 
♦Test  Developers  or  Users  =  (DU) 


Classification  Area 


Job  Title  of  Interviewee 


U.S.  Army  Air  Defense 
School  (continued) 

TEC  Program 


Training  Center  Program 


Directorate, 
Department  or  Division 

Office  of  the  Commandant 


Education  Advisor 


( 


Training  Development 
Division 


Air  Defense  Artillery 
Training  Brigade 


Chief  of  the  Division  ( 

Chief  Project  Officer 

for  TEC  Production  ( 

Project  Officer  ( 

Project  Officer  ( 

Training  Coordinator  ( 


Table  B-3 


FORT  SILL  INTERVIEWEES 


Directorate, 

! 

Classification  Area 

Department  or  Division 

f  Job  Title  of  Interviewee 

U.S.  Army  Field 

Tactic  Combined  Arms 

Chief,  Associate  Arms 

(S)** 

Artillery  Training 

Department 

Division 

School 

Senior  Instructor 

(DU)* 

Gunnery  Department 

Chief,  Exam  Branch 

(S) 

Instructor/Grader 

(DU) 

Office  of  the  Commandant 

Education  Advisor 

(S) 

Office  of  the  Deputy 

Educational  Specialist 

(S) 

Assistant  Commandant 
for  Training  &  Education 

Educational  Specialist 

(S) 

Materiel  &  Maintenance 

Chief,  Cannon  Division 

(S) 

Department 

Instructor 

(DU) 

Target  Acquisition 

Supervisory  Training 

i. 

Department 

Specialist 

(S) 

Instructor 

(DU) 

Command,  Leadership  and 

Senior  Instructor 

(DU) 

Training  Department 

Senior  Instructor 

(DU) 

Communications/Electronics 

Training  Instructor 

(DU) 

■ 

,  Department 

; 

MOS  Testing  Program 

Evaluation  Brigade 

Chief,  MOS  Analysis 

(S) 

Training,  Center  Program 

Advanced  Individual  Train- 

Officer  in  Charge 

(S) 

ing  Brigade 

«■ 

Senior  Instructor 

(DU) 

Instructor  in  Charge  of 

(DU) 

**Supervisors  of  Test  Development '*  (s) 
*Test  Development  or  Users  =  (DU) 


79  - 


Table  S-*  (continued1) 


glassification  iro. 
TEC  Program 


Directorate , 
^Department  or  Division 

Army-Wide  Training  Support! 
Department  1 


_joh  Title  of  Interviewee 
Chief  or  Department  ($) 


Table  B-U 


FORT  KNOX  INTERVIEWEES 


Directorate, 

1 

Classification  Area 

Department  or  Division 

1  Job  Title  of  Interviewee 

U.S.  Army  Armor  School 

Directorate  of  Training 

Chief,  Task  Analysis 
Division 

Test  Director,  MOS 

(S)** 

Evaluations 

(S) 

Leadership  Department 

Instructor,  System  and 

Procedures  Branch 

(DU)* 

Army  Wide  Training  Support 

Chief,  Development 

Division 

(S) 

Directorate  of  Instruction 

Chief,  Instruction 

Technology  Division 

(S) 

Instructor,  Instruction 

Technology  Division 

(DU) 

Educational  Specialist 

» 

Evaluation  Branch 

(S) 

Chief,  Curriculum 

Branch 

(S) 

C  and  S  Department 

Chief,  Cavalry  Branch 

(DU/S) 

Senior  Instructor, 

Small  Unit  Tactical 
Operations 

(DU) 

Automotive  Department 

Chief,  Quality  Control 

, 

Branch 

(S) 

Weapons  Department 

Training  Administrator 

(DU) 

Training  Center 

Headquarters  1st  AIT 

S-3  1st  AIT  Brigade 

(S) 

Brigade 

**Supervisors  of  Test  Development  =  (S) 
*Test  Developers  or  Users  *  (DU) 


Table  8 


FORT  ORD  INTERVIEWEES 


a 

Jfc*. 


Classification  Area 

Dl root orate , 
Department  or  Division 

Dob  Title  of  Interview 

«'t4 

U.S.  Army  Training 
Center 

Directorate 

Quality  Control  Branch 

Chief,  Ouallrv  Control 

of  Plans  and 

Branch 

(S)** 

i 

Training 

■ 

Training  Evaluator, 
Quality  Control 

Branch 

(S') 

i 

> 

j 

Basic  Combat  Training 
Testing 

Prelect  Test  Officer, 
Qua  1  i  ty  Cont ro  1 

i 

Branch 

(inn* 

: 

Instructor,  Proficiency 

i 

t 

i 

i 

i 

• 

Test  Branch 

(DU) 

Sasic  Combat 

Training  Command  (Prov) 

Operations  and 

Training 

Training  Officer 

(S) 

4 

Training  Brigade 

Battalion  Commander 

(S) 

t 

i 

Battalion  Executive 
Officer 

(S) 

Company  Commander 

(S) 

i 

i 

Company  Commander 

(S) 

$ 

1  . 

Of  f  iccr-ln-Charge , 

* 

if 

■ 

First  Aid  Committee 
Croup 

(DU) 

I 

Instructor,  First  Aid 
Committee  Croup 

(DU) 

‘•Supervisors  of  Test  Development  ■  (S) 
•Tost  Developers  or  Users  *  (DU) 


Class  i  f  tea:  l  on 


Dt  rect  orate. 
Denar tnent  or  Division 


Job  Ti  t  le  of  !  nt  orv  t  ■ 


Basio  Combat 

training 

(continued) 


Nonconml ss ioned  i'l  l  iccr- 
in-Charge  of  lmllviklu.il 
Tactical  Training  (DC) 


Advanced 
tnd i vidua  1 
Training 


Field  Wireman  Division 


Senior  Drill  Instructor  ilHO 

Drill  Instructor  (DTI 

Chief,  Field  Wireman 

Training  Division  (S) 

Instructor,  Field 
Wireman  Training 
Division  (DiO 


Food  Services  Division 


Supervisor,  Hood 

Services  Division  (S) 


Instructor,  Food 
Services  Division 


APPENDIX  C 


QUANTITATIVE  DATA  GATHERED  DURING  ARMY  CRT  SURVEY 
Fort  Benniag,  Georgia 


Training 

MOS 

:  ’ 

TEC  (TraltiJitj- 

S.  !«.>. 

»1  On 

i  Icnlurf 

j  C»«tcr  Cm  r  i <*i.  Um 

T*  • 

t  i  .p. 

r.  ranch 

Ext  on:-  IctS 

c.Vur’ 

1 

s* 

du*< 

S  Plf 

S 

'DU 

s 

OH 

t 

X 

t 

7. 

#  Z  #2 

f 

Z 

# 

X 

# 

X 

# 

X 

Ilit  S.>.  R 

t-sp. 

'll-; 

R.  «**. 

Us 

Rcnp.  Yes  R<*«.n.  Ye*. 

Pc  n. 

Yen 

Yrs 

I’rr.p . 

Yes 

Ri'Sn . 

Yes 

4. 

II  volvt'j  111  Will' 

3 

100 

j  2 

88 

|  1 

100 

1 

100 

1 

100 

1 

100 

?b  JvC  t  iVfN ? 

*» 

lift*  tv-*  .wpef  It  }:>n- 

1 

0 

6 

67 

1 

100 

ilv,  illy 

JT  1  l  tv  i* 

5. 

l’.n  t  tc  < pared  in 
setting  ebjectives? 

3 

67 

11 

64 

1 

100 

1 

100 

1 

100 

1 

100 

6. 

Jnr.y*rd  si  Jfitfi! 

i  const  r  .lints? 

3 

67 

12 

83 

1 

100 

1 

100 

1 

100 

1 

100 

7. 

Helped  determine 
priori  t j  e*  ? 

‘2 

50 

12 

58 

1 

*  • 

100 

1 

0 

1 

100 

1 

0 

8. 

Did  you  write  It***? 

3 

33 

1? 

100 

1 

100 

1 

100 

1 

100 

1 

100 

8b. 

Item  pool? 

3 

0 

11 

73 

1 

100 

1 

100 

9. 

Involved  in  Selecting 
final  items? 

2 

0 

11  , 

43 

1 

100 

1 

.0 

1 

100 

1 

0 

9b. 

Use  It**  analysis 
technique? 

1 

0 

8 

25 

1 

ICO 

1 

0 

iO. 

Par  t i c 1  paced  in  test 
administration? 

3 

66 

12 

100 

1 

0 

1 

100 

1 

100 

1 

100 

10b. 

Ever  assisted  sooeone 
taking  test? 

2 

50 

12 

*? 

i 

1 

100 

11. 

Involved  in  measuring 
test  reliability?  e 

3 

33 

n 

55 

1 

100 

1 

100 

.  1 

100 

1 

0 

lib. 

Compute  coefficients 
of  reliability? 

2 

0 

4 

50 

1 

100 

1 

100 

12. 

Aid  in  validating  tests? 

3 

0 

11 

76 

, 

l 

100 

l 

100 

1 

100 

1 

0  * 

12b . 

Use  of  content  validity? 

2 

50 

3 

0 

1 

100 

13. 

scoring:  Norm  or  go- 

no-go? 

> 

bo 

12 

33 

1 

0 

1 

0 

1 

100 

1 

100 

14. 

Test  results  used  to 
compare  student  per¬ 
formance? 

3 

66 

1! 

91 

l 

1™> 

1 

100 

3 

100 

1 

0 

14b. 

Retest? 

2 

100 

7 

28 

13. 

Feedback  used  to 
improve  tests? 

4 

30 

12 

50 

1 

0 

l 

100 

1 

100 

1 

100 

16. 

Tests  used  for 
diagnosis? 

3 

33 

12 

23 

1 

100 

1 

100 

1 

100 

1 

0 

17. 

Aware  of  other  aspects? 

2 

50 

3 

40 

. 

1 

0 

, 

1 

0 

26. 

Training  available 
for  testing? 

2  . 

30 

9 

77 

■ 

1 

100 

1 

0 

1 

•100 

1 

100 

27. 

See  following  page  for 
thin  item. 

’  1 

‘ 

28. 

T  it*  for  team  per¬ 
formance  evaluation? 

2 

100 

8 

75 

1 

100 

1 

0 

1 

100 

l 

0 

30. 

Constraints  restrictive 
to  test  development? 

3 

67 

10 

.0 

31. 

Any  tests  m*ult  jMe 
for  intended  uses? 

1 

100 

.  4 

50 

34. 

Sec  following  page  for 
this  Item. 

33. 

Are  CRTs  more  expensive 
than  NRTs? 

73 

3 

67 

1 

0 

1 

100 

40. 

Crltet loft-ref errnceu 
testing  practice.  and 
useful  ? 

1 

100 

9 

100 

. 

' 

• 

1 

100 

1 

lOO 

1 

100 

•  Supervisor*  of  Test  Development  urAtnt  U*«, 

•*  Developer*  oo 4/or  User*  of  Toot*. 


Fort  Benning,  Georgia 


School  S 

Curricula* 

DU** 

Training  S 

Center 

Curricula*  DU 


MOS  s 

Toting 

■ranch  DU 


T>C  (Training  s 
Extension 
Court#)  DU 


lie*  ?7 

l'roport  ton  of  t.-.t.  „r 


A. 

F.njn'r 

ft. 

Slnnl  jtc.l 

c. 

D. 

1 

■exp . 

*  IVnv  1 1 

Testa 
_ X _ 

Perl  or*,  jure 

Teats 

X 

‘'Hands-On** 

Tc*t» 

Z 

Otfier 

2 

12.5 

12.5 

75 

I 

0 

11 

73 

0 

27 

0 

1 

100 

0 

0 

0 

1 

50 

0 

50 

0 

1 

10 

20 

70 

0 

1 

50 

50 

0 

0 

School  S 

Curricula* 

DU 


Training  S 

C*«t#r 

Curricula*  DU 


Ito*  27,  Part  2 


Ft ©portion  of 

i*ata  nod*  or  ua#d  for: 

# 

■cap. 

1 

9 

A. 

Specific  Skill 

4  Knowledge 

F#quirc*cnta 

X 

1. 

Specialty 

Areas  In  a 

Course 

X 

c. 

Ervf  of  Block 
Within  a 

Court# 

X 

D. 

Mld-Cyel# 
Within  # 
Court# 

w 

8. 

End  of 
Court# 

20 

11 

20 

14 

20 

34 

z 

20 

11 

X 

20 

30 

MOS  $ 

Tvatlng 

■ranch  DU  , 

1  20 

TtC  (Training  S  «  1 

Extension  °i 

Court#)  DU 


20 


20 


20  20 


0 


73 


School  * 

Curricula*. 

DU 


Training  S 

Center 

Curricula*  DU 


Strength  of 


It#*  34 

opinion  about  future  of  CUT 


in  Amy 


Strongl f 
Agalnat 
l 


0 


Agalntt  M#ueral 


30 


For 

I 

30 

30 


Strongly 

For 


MGS  g 

Teat  1  rtf 

■ranch  DU 


TFC  (Training  S 
Extent  In# 
Court#)  DU 


a*  tZ  ,  Qrm  ul  ™  ^velotmrnt  a 
Developer#  and/or  Users  of  Teat  a. 


0 

0 


0 

0 


0 

0 


0 

0 


0 

0 


0 

0 


100 

100 


-  S6  - 


Fort  Bliss 


Texas 


Sch« 

o 1  Curriculum 

Cent 

Tr.il  nine 
i  r  v’uri  l .  t*  1* 

IS 

MO* 

T»  *r»np  Srant-h 

TFT  fir 
KxtrnM--* 

.ti-tn- 

1  ,  .  1 

> 

S* 

nr** 

£ 

nu 

S  DO 

S 

1*1 

Item  Kv- . 

1 

^»P'. 

X 

Yes 

# 

X 

* 

# 

X  t 

Yes  Heap. 

r 

Yc* 

f  x  i  r 

heap.  Y  rs  Hasp ._ 

1 

8esp. 

X 

Yes 

<* 

Hi  -  r  . 

v,  . 

* 

% 

4. 

Involvd  in  writing 
object  iws? 

7 

71 

12 

100 

2 

100 

1 

100 

: 

l  tin 

4b 

Objective#  operation¬ 
al  iv,  bi  h.ix  f  orally 
written? 

3 

b7 

11 

K2 

2 

100 

• 

5* 

Participated  in 
wtUnp  objectives? 

6 

83 

12 

100 

2 

SO 

1 

100 

2 

l 

6, 

Imposed  practical 
contralnts? 

7 

» 

11 

91 

2 

50 

1 

100 

s 

loo 

7. 

Helped  determine 
priorities? 

7 

71 

a 

91 

2 

30 

1 

100 

2 

mo 

8. 

Did  vow  wilte  items? 

b 
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